INDEX
Explanations
references to political figures and their actions or statements
New Auto-Interp
Negative Logits
InBackground
-0.18
uada
-0.15
GUIDE
-0.15
Reporting
-0.14
çIJ³
-0.14
_representation
-0.14
.qual
-0.14
VICE
-0.13
isyon
-0.13
arem
-0.13
POSITIVE LOGITS
piv
0.18
preview
0.18
λÏī
0.16
jab
0.16
duck
0.16
malign
0.15
riff
0.15
again
0.15
spar
0.15
deliver
0.15
Activations Density 0.110%