INDEX
Explanations
references to specific groups of people or demographic categories
New Auto-Interp
Negative Logits
aneously
-0.76
stals
-0.71
ctors
-0.66
izens
-0.64
unts
-0.64
ints
-0.63
..........
-0.63
stein
-0.63
neys
-0.62
Galile
-0.60
POSITIVE LOGITS
heet
1.37
hip
1.21
hare
1.16
mith
1.16
cape
1.07
cale
1.06
ilver
1.04
pring
1.04
hift
1.03
pace
1.01
Activations Density 0.119%