INDEX
Explanations
references to public figures or entities involved in political discussions
New Auto-Interp
Negative Logits
ifo
-0.16
pard
-0.16
azz
-0.15
borrow
-0.14
aw
-0.14
oom
-0.14
ue
-0.14
Äĥm
-0.14
gly
-0.13
Crack
-0.13
POSITIVE LOGITS
andro
0.18
duto
0.15
&E
0.15
resher
0.15
olv
0.15
getSingleton
0.14
cen
0.14
XR
0.14
ornings
0.14
etro
0.14
Activations Density 0.063%