INDEX
Explanations
statements or references related to causing offense
language related to causing offense or being offended
New Auto-Interp
Negative Logits
Monitor
-0.77
Phase
-0.70
omed
-0.69
Helm
-0.68
Canaver
-0.68
packed
-0.66
Maid
-0.65
Mech
-0.65
Phase
-0.65
Motors
-0.65
POSITIVE LOGITS
offending
3.38
offended
3.03
offend
2.94
insulted
1.87
angered
1.49
infring
1.46
offence
1.45
insult
1.39
blasp
1.36
insults
1.31
Activations Density 0.026%