INDEX
Explanations
harmful stereotypes and violent thoughts
New Auto-Interp
Negative Logits
excessive
0.47
improper
0.43
혹은
0.41
有害
0.41
publiés
0.40
⛔
0.40
बुरा
0.40
inadmissible
0.40
unsuitable
0.39
悪い
0.39
POSITIVE LOGITS
stereotypes
0.90
stereotype
0.79
Stere
0.63
stereotyp
0.59
stere
0.59
стере
0.57
stereotypical
0.54
Stere
0.54
tropes
0.53
stere
0.53
Activations Density 0.033%