INDEX
Explanations
expressions related to demographic categories such as age, income, race, gender, and background
New Auto-Interp
Negative Logits
tein
-0.80
rence
-0.77
gotten
-0.71
ãĥį
-0.67
ãģį
-0.66
owicz
-0.66
prison
-0.64
enko
-0.64
Tro
-0.63
ש
-0.63
POSITIVE LOGITS
imaginable
1.17
ranging
1.04
etting
1.04
vying
0.97
guiActiveUnfocused
0.89
depending
0.88
simultaneously
0.87
paces
0.85
differing
0.82
cale
0.82
Activations Density 0.304%