INDEX
Explanations
phrases that bring attention to social and political issues
New Auto-Interp
Negative Logits
--+
-0.78
fork
-0.77
ée
-0.72
Iterator
-0.69
NING
-0.69
beam
-0.69
tails
-0.67
andowski
-0.66
ulla
-0.66
pour
-0.66
POSITIVE LOGITS
injust
1.02
dangers
0.92
shortcomings
0.88
misogyny
0.88
atrocities
0.87
wrongdoing
0.87
homosexuality
0.87
abuses
0.87
issues
0.86
sexism
0.83
Activations Density 0.119%