INDEX
Explanations
references to friendships and interpersonal relationships
New Auto-Interp
Negative Logits
were
-0.17
were
-0.17
uzzer
-0.17
weren
-0.16
Were
-0.15
بÙĪØ¯ÙĨد
-0.15
Were
-0.14
ómo
-0.14
IDER
-0.14
(tol
-0.14
POSITIVE LOGITS
ist
0.40
kommt
0.38
hat
0.38
wird
0.37
steht
0.35
stellt
0.35
lässt
0.34
bleibt
0.34
liegt
0.33
geht
0.33
Activations Density 0.048%