INDEX
Explanations
various forms of offensive language and profanity
New Auto-Interp
Negative Logits
dale
-0.15
kinson
-0.15
cli
-0.15
âĦ
-0.15
/plugin
-0.14
yang
-0.14
arden
-0.14
ullo
-0.14
zel
-0.14
ernal
-0.14
POSITIVE LOGITS
edd
0.14
éĦ
0.13
oten
0.13
Reef
0.13
YLE
0.13
elen
0.13
hur
0.13
룡
0.13
ihn
0.13
Duy
0.13
Activations Density 0.019%