INDEX
Explanations
expressions of insult or derogatory remarks
New Auto-Interp
Negative Logits
hin
-0.17
лиÑĨ
-0.16
šet
-0.16
overy
-0.16
UIFont
-0.15
夫
-0.15
eka
-0.15
reno
-0.15
ahat
-0.15
rze
-0.14
POSITIVE LOGITS
ably
0.16
nit
0.15
alla
0.15
berman
0.14
odate
0.14
Morrison
0.14
peria
0.14
ÎĨ
0.13
breaker
0.13
/api
0.13
Activations Density 0.003%