INDEX
Explanations
references to groups of people or collective actions
references to collective experiences or generalizations about groups of people
New Auto-Interp
Negative Logits
aic
-0.87
ean
-0.81
eda
-0.75
antic
-0.74
ularity
-0.73
ea
-0.72
rd
-0.69
eus
-0.67
hent
-0.67
effect
-0.65
POSITIVE LOGITS
else
1.13
selves
0.80
bags
0.79
THING
0.77
WAYS
0.76
wanna
0.75
nodd
0.73
````
0.73
gotta
0.72
bage
0.72
Activations Density 0.040%