INDEX
Explanations
words related to negative behavior such as abusive, rude, derogatory, and hateful
language related to abusive or harmful behavior
New Auto-Interp
Negative Logits
oleon
-0.93
obyl
-0.92
zzo
-0.88
DragonMagazine
-0.85
igham
-0.85
Downloadha
-0.84
zig
-0.83
ortal
-0.83
ariat
-0.82
akeru
-0.82
POSITIVE LOGITS
behav
1.06
behaviour
0.94
abusive
0.86
behavior
0.85
soever
0.83
alien
0.82
distractions
0.80
aspects
0.80
undermin
0.79
slurs
0.78
Activations Density 0.032%