INDEX
Explanations
instances of the word "other" and its variations
New Auto-Interp
Negative Logits
ven
-0.18
ryn
-0.17
koli
-0.16
utan
-0.16
ycz
-0.16
fty
-0.16
ray
-0.15
ilar
-0.15
rong
-0.15
rames
-0.14
POSITIVE LOGITS
world
0.28
ewise
0.26
wis
0.23
word
0.23
than
0.23
_than
0.22
ness
0.21
-than
0.21
-world
0.20
ullo
0.20
Activations Density 0.080%