INDEX
Explanations
phrases related to controversial or offensive language or actions
New Auto-Interp
Negative Logits
frames
-0.76
hower
-0.74
negie
-0.73
runner
-0.67
olin
-0.67
Stability
-0.66
illon
-0.65
stabilization
-0.64
aea
-0.63
itness
-0.63
POSITIVE LOGITS
slurs
1.38
insults
1.10
slur
1.06
insulted
1.05
insulting
1.05
jokes
1.03
remarks
1.03
insult
1.02
caricature
1.02
homophobic
1.01
Activations Density 0.304%