INDEX
Explanations
instances of collective or group-related language
New Auto-Interp
Negative Logits
_OW
-0.17
isd
-0.16
pty
-0.15
оÑĤÑĮ
-0.14
arken
-0.14
treff
-0.14
parallel
-0.14
zers
-0.14
inis
-0.14
asurer
-0.14
POSITIVE LOGITS
alike
0.18
their
0.17
tall
0.16
themselves
0.15
family
0.15
their
0.15
erst
0.15
Tall
0.15
loved
0.14
families
0.14
Activations Density 0.133%