INDEX
Explanations
offensive language
derogatory terms directed at individuals in discussions about social issues
New Auto-Interp
Negative Logits
ļéĨĴ
-0.87
»Ĵ
-0.77
tions
-0.73
streamlined
-0.67
exting
-0.66
srf
-0.65
ActionCode
-0.63
Reviewer
-0.63
ă
-0.61
ü
-0.61
POSITIVE LOGITS
congratulations
0.73
sorry
0.71
surely
0.67
kidding
0.67
damned
0.67
sorry
0.66
Wr
0.66
Well
0.65
Wrong
0.64
liar
0.62
Activations Density 0.326%