INDEX
Explanations
phrases related to specific professions or characteristics describing people
words that denote various social roles and identities
New Auto-Interp
Negative Logits
tails
-0.74
izens
-0.71
regions
-0.70
Us
-0.70
rams
-0.69
Lans
-0.68
ouls
-0.67
slopes
-0.67
olas
-0.66
bins
-0.66
POSITIVE LOGITS
digy
0.79
unto
0.76
iste
0.75
ess
0.74
chuk
0.72
herself
0.72
smith
0.71
alyst
0.70
myself
0.70
nik
0.69
Activations Density 0.278%