INDEX
Explanations
expressions of surprise or disbelief
New Auto-Interp
Negative Logits
segreg
-0.67
elim
-0.66
Lesbian
-0.63
Personality
-0.62
Luther
-0.61
Feld
-0.61
Liberia
-0.60
Townsend
-0.60
Spa
-0.60
Pixie
-0.58
POSITIVE LOGITS
esome
1.57
akening
1.28
kward
1.16
alls
1.03
ards
1.00
iring
0.95
ake
0.92
orks
0.91
aw
0.91
reck
0.90
Activations Density 0.009%