INDEX
Explanations
social issues related to inequality and gender biases
New Auto-Interp
Negative Logits
hyde
-0.73
eters
-0.72
oths
-0.70
ategory
-0.69
ĸļ
-0.67
ptin
-0.66
aryn
-0.66
eteria
-0.63
Canaver
-0.62
leted
-0.62
POSITIVE LOGITS
enough
1.22
bye
1.15
luck
1.02
luck
1.01
intentions
0.97
reads
0.94
enough
0.89
nat
0.87
Samar
0.86
sell
0.86
Activations Density 3.623%