INDEX
Explanations
terms related to political figures and locations
occurrences of the word "no."
New Auto-Interp
Negative Logits
RAFT
-0.83
lycer
-0.76
aven
-0.74
assies
-0.73
rosse
-0.68
schild
-0.68
irlf
-0.67
tein
-0.65
rican
-0.65
iership
-0.64
POSITIVE LOGITS
zzle
1.15
etheless
1.14
terday
1.08
xious
0.98
obs
0.93
except
0.89
ct
0.85
longer
0.84
ise
0.83
ises
0.77
Activations Density 0.026%