INDEX
Explanations
references to surveillance and control by authority figures
New Auto-Interp
Negative Logits
IDEOS
-0.16
onaut
-0.16
ona
-0.15
ÑĢой
-0.15
pedo
-0.14
adj
-0.14
canonical
-0.14
LGBTQ
-0.14
iffer
-0.14
RI
-0.14
POSITIVE LOGITS
AZY
0.17
symbol
0.17
Symbol
0.17
symbol
0.16
(symbol
0.15
blinded
0.15
-symbol
0.15
ymbol
0.15
sublic
0.14
.Symbol
0.14
Activations Density 0.021%