INDEX
Explanations
words related to social interactions or behaviors
terms related to social behavior
New Auto-Interp
Negative Logits
nces
-0.84
atche
-0.79
xual
-0.75
ilts
-0.74
gran
-0.72
1001
-0.71
ras
-0.70
gger
-0.70
shall
-0.70
oning
-0.69
POSITIVE LOGITS
norms
0.96
interaction
0.95
interactions
0.95
cues
0.89
ized
0.84
relations
0.82
gatherings
0.81
izing
0.78
istic
0.77
affili
0.77
Activations Density 0.025%