INDEX
Explanations
instances of strong profanity and derogatory language
New Auto-Interp
Negative Logits
nodoc
-0.15
sem
-0.15
Ì
-0.14
emens
-0.14
-kit
-0.14
ozem
-0.14
812
-0.14
zure
-0.14
.diag
-0.14
spec
-0.14
POSITIVE LOGITS
abbo
0.16
ason
0.15
ppo
0.15
endale
0.15
Kidd
0.14
mue
0.14
eniable
0.14
Carlson
0.14
634
0.13
erguson
0.13
Activations Density 0.027%