INDEX
Explanations
negative or harmful language, such as derogatory or inappropriate terms
terms related to negative or harmful content and behavior
New Auto-Interp
Negative Logits
tested
-0.86
rage
-0.86
emis
-0.82
hung
-0.81
united
-0.81
wright
-0.80
lite
-0.80
abiding
-0.79
winning
-0.78
bender
-0.78
POSITIVE LOGITS
behavior
1.13
materials
1.13
material
1.12
behaviour
1.12
activities
1.11
activity
1.10
conduct
1.06
behaviors
1.06
situations
1.03
items
1.02
Activations Density 0.267%