INDEX
Explanations
statements involving controversial or sensitive political topics
New Auto-Interp
Negative Logits
èļ
-0.17
TION
-0.16
alar
-0.15
Slut
-0.14
legates
-0.14
ault
-0.14
Britt
-0.14
oms
-0.14
cheon
-0.14
ura
-0.14
POSITIVE LOGITS
ench
0.15
Johns
0.14
Portal
0.14
ogonal
0.14
erver
0.13
Sources
0.13
dater
0.13
Hoe
0.13
enna
0.13
íĥĪ
0.13
Activations Density 0.031%