INDEX
Explanations
references to political opposition and public criticism
New Auto-Interp
Negative Logits
ktop
-0.17
yna
-0.16
宣
-0.16
amed
-0.14
ripp
-0.14
itchen
-0.13
ennes
-0.13
ien
-0.13
nnen
-0.13
pronounce
-0.13
POSITIVE LOGITS
Privacy
0.17
Privacy
0.16
ohana
0.16
privacy
0.16
urum
0.15
squ
0.15
groups
0.14
à¸ı
0.14
some
0.13
Trim
0.13
Activations Density 0.119%