INDEX
Explanations
references to specific individuals or characters in various contexts
New Auto-Interp
Negative Logits
himself
-0.18
Himself
-0.18
unga
-0.18
opr
-0.16
.libs
-0.15
еÑĢж
-0.14
nÃło
-0.14
sám
-0.14
ungi
-0.14
nÃŃ
-0.14
POSITIVE LOGITS
alike
0.42
respectively
0.33
both
0.28
themselves
0.28
BOTH
0.25
both
0.25
sowie
0.24
ê°ģê°ģ
0.24
serta
0.23
their
0.23
Activations Density 0.236%