INDEX
Explanations
expressions of mutual support and connection between individuals
New Auto-Interp
Negative Logits
itself
-0.20
_OTHER
-0.17
ug
-0.16
etur
-0.16
together
-0.16
zusammen
-0.14
otherwise
-0.14
arn
-0.14
furt
-0.14
ablo
-0.14
POSITIVE LOGITS
hood
0.20
nhau
0.17
/us
0.16
elves
0.15
-même
0.15
ieron
0.14
/all
0.14
mutually
0.14
's
0.14
across
0.14
Activations Density 0.019%