INDEX
Explanations
words related to attitudes and opinions
references to social attitudes and perceptions
New Auto-Interp
Negative Logits
ded
-0.79
gran
-0.74
amen
-0.73
avez
-0.71
addafi
-0.70
aman
-0.67
cuts
-0.66
Jub
-0.64
Delivery
-0.64
MER
-0.63
POSITIVE LOGITS
attitudes
1.07
guiActiveUn
0.89
pring
0.86
ocial
0.84
terness
0.82
toward
0.79
insula
0.76
towards
0.76
hovah
0.76
yip
0.75
Activations Density 0.013%