INDEX
Explanations
mentions of different groups of people in various contexts, such as researchers, voters, consumers, individuals, players, Jews, and liberals
references to groups of people or individuals in various contexts
New Auto-Interp
Negative Logits
stars
-0.64
ces
-0.61
Shore
-0.61
ILE
-0.61
aughs
-0.60
Aw
-0.60
UV
-0.59
DOWN
-0.58
sie
-0.58
shows
-0.58
POSITIVE LOGITS
are
1.10
aren
1.03
perceive
0.98
prefer
0.97
have
0.97
were
0.95
everywhere
0.94
crave
0.93
realize
0.91
weren
0.91
Activations Density 0.294%