INDEX
Explanations
lists of strengths and weaknesses
New Auto-Interp
Negative Logits
injustices
0.20
denunci
0.20
prostit
0.19
renunci
0.19
injustice
0.18
wrongdoing
0.18
immoral
0.18
अन्याय
0.18
minorities
0.18
societal
0.18
POSITIVE LOGITS
{0.18
I
0.17
compatible
0.16
y
0.16
6
0.16
am
0.16
s
0.16
et
0.16
}\
0.16
which
0.16
Activations Density 0.000%