INDEX
Explanations
expressions of fandom or interest related to conspiracy theories and specific groups of people
New Auto-Interp
Negative Logits
erval
-0.16
Surveillance
-0.15
uest
-0.15
ozy
-0.14
Rough
-0.14
ZERO
-0.14
ilar
-0.14
мова
-0.14
tearDown
-0.13
OTHERWISE
-0.13
POSITIVE LOGITS
.hp
0.16
axon
0.16
aver
0.15
bern
0.14
curs
0.14
iyat
0.14
оÑģоб
0.14
inha
0.13
romance
0.13
look
0.13
Activations Density 0.115%