INDEX
Explanations
phrases associated with political positions and actions
New Auto-Interp
Negative Logits
aeda
-0.16
ustralian
-0.15
abic
-0.15
erdem
-0.15
amo
-0.15
olen
-0.14
ileged
-0.14
åIJ
-0.14
fak
-0.14
odied
-0.14
POSITIVE LOGITS
intent
0.22
hell
0.20
determined
0.19
intent
0.19
caught
0.19
tone
0.18
bent
0.18
allergic
0.18
toast
0.18
boxed
0.17
Activations Density 0.169%