INDEX
Explanations
phrases or expressions related to resistance or dissent
New Auto-Interp
Negative Logits
robe
-0.16
esz
-0.16
utsch
-0.15
iley
-0.15
Aws
-0.15
bbe
-0.15
Copyright
-0.15
GIN
-0.14
Smarty
-0.14
女åŃIJ
-0.14
POSITIVE LOGITS
Newman
0.15
Lam
0.15
/modal
0.14
lam
0.14
toll
0.14
outs
0.14
(
0.14
apur
0.13
isma
0.13
variable
0.13
Activations Density 0.046%