INDEX
Explanations
harmful, unethical, racist, sexist, toxic, dangerous, or illegal
New Auto-Interp
Negative Logits
entrambe
0.46
entrambi
0.46
oba
0.44
beide
0.42
xticks
0.41
لعاب
0.40
ambos
0.40
দীর
0.40
ridges
0.39
obu
0.38
POSITIVE LOGITS
All
0.40
அனைத்தும்
0.39
All
0.38
എന്നിവ
0.38
Все
0.37
Sense
0.37
→
0.36
ALL
0.36
→
0.36
НЕ
0.36
Activations Density 0.039%