INDEX
Explanations
references to emotions or reactions within social contexts
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.08
3:0.06
4:0.21
5:0.03
6:0.18
7:0.12
8:0.03
9:0.07
10:0.06
11:0.07
Negative Logits
flashlight
-1.24
ALWAYS
-1.19
masturb
-1.19
Shift
-1.14
foreskin
-1.13
Cola
-1.12
istries
-1.12
guided
-1.11
recip
-1.11
successfully
-1.10
POSITIVE LOGITS
Horowitz
1.39
Colomb
1.36
Santos
1.36
Byrne
1.26
Brock
1.25
Vale
1.23
igans
1.21
Regina
1.21
ablishment
1.19
Kop
1.19
Activations Density 0.001%