INDEX
Explanations
phrases related to people and social interactions
references to various groups of people, often in negative or stereotypical contexts
New Auto-Interp
Negative Logits
Pyr
-0.52
Cur
-0.49
IER
-0.48
Byr
-0.48
Prim
-0.47
Grind
-0.47
igor
-0.47
incumbent
-0.47
Vulcan
-0.47
Ranger
-0.46
POSITIVE LOGITS
rejoice
0.85
unite
0.76
hate
0.75
beware
0.74
sue
0.73
adore
0.73
prefer
0.71
disapprove
0.70
dont
0.70
paces
0.69
Activations Density 0.204%