INDEX
Explanations
offensive language and derogatory terms
New Auto-Interp
Negative Logits
kinson
-0.17
ิร
-0.15
zik
-0.14
ners
-0.14
urve
-0.14
ecided
-0.14
zel
-0.14
/plugin
-0.14
mv
-0.13
ÑĥÑĩ
-0.13
POSITIVE LOGITS
YLE
0.15
ê¶Į
0.14
edd
0.14
oten
0.14
umb
0.14
emouth
0.13
aroo
0.13
franca
0.13
elen
0.13
_OPTS
0.13
Activations Density 0.028%