INDEX
Explanations
words related to negative or harmful behavior or actions, such as abusive, deceptive, and oppressive
terms associated with abusive or harmful behaviors and practices
New Auto-Interp
Negative Logits
ild
-0.92
igating
-0.87
inen
-0.86
igated
-0.85
ighed
-0.85
igate
-0.84
osal
-0.83
oleon
-0.83
izen
-0.82
Downloadha
-0.80
POSITIVE LOGITS
abusive
1.26
citiz
0.88
behav
0.83
behaviour
0.83
oppressive
0.81
undermin
0.78
volent
0.77
tendencies
0.77
discriminatory
0.77
minded
0.75
Activations Density 0.021%