INDEX
Explanations
emotional responses, particularly feelings of disappointment, anger, shock, and happiness
New Auto-Interp
Negative Logits
eds
-0.16
416
-0.15
ader
-0.15
oss
-0.14
lor
-0.14
rop
-0.14
onth
-0.14
andalone
-0.14
θα
-0.14
berman
-0.13
POSITIVE LOGITS
about
0.20
withObject
0.17
contres
0.16
ingly
0.15
hearing
0.15
åIJ¬åΰ
0.15
ajar
0.15
isque
0.15
/dist
0.15
bahwa
0.15
Activations Density 0.151%