INDEX
Explanations
expressions of hatred or strong negative sentiments towards individuals or groups
New Auto-Interp
Negative Logits
/latest
-0.17
691
-0.16
AdapterFactory
-0.16
mates
-0.15
ddit
-0.15
ces
-0.15
leanup
-0.15
uckle
-0.14
oleans
-0.14
gency
-0.14
POSITIVE LOGITS
irl
0.15
GLE
0.14
rus
0.14
/env
0.14
rypto
0.14
AKE
0.14
yne
0.14
is
0.14
ży
0.14
mis
0.13
Activations Density 0.107%