INDEX
Explanations
terms related to political activities and negative actions
New Auto-Interp
Negative Logits
»Ĵ
-0.90
anamo
-0.79
aido
-0.79
acers
-0.78
sidx
-0.77
izont
-0.77
undreds
-0.76
enz
-0.76
laws
-0.75
aughters
-0.74
POSITIVE LOGITS
perpetrated
1.09
concoct
1.05
aimed
1.03
unworthy
1.00
ploy
0.99
rather
0.94
intended
0.90
gimmick
0.90
meant
0.89
akin
0.89
Activations Density 0.272%