INDEX
Explanations
locations or places
references to locations or places
New Auto-Interp
Negative Logits
rd
-0.75
arb
-0.75
agascar
-0.70
unes
-0.67
CHAT
-0.66
olyn
-0.65
onday
-0.64
umin
-0.63
icial
-0.63
luaj
-0.63
POSITIVE LOGITS
holder
1.17
holders
1.13
bos
0.99
upon
0.81
ername
0.77
liness
0.75
knife
0.74
holder
0.74
bo
0.73
ngth
0.72
Activations Density 0.043%