INDEX
Explanations
words related to controversial political figures and actions
references to torture and related historical figures
New Auto-Interp
Negative Logits
Ü
-0.93
âĸ¬
-0.85
ä
-0.82
à
-0.80
liest
-0.78
UAL
-0.77
ãĤ·
-0.73
Minecraft
-0.72
RGB
-0.71
ãĥī
-0.71
POSITIVE LOGITS
Bolton
0.84
Tort
0.82
ramid
0.81
ongyang
0.81
ombo
0.78
kefeller
0.77
artisan
0.77
terness
0.76
odies
0.75
oreal
0.75
Activations Density 0.023%