INDEX
Explanations
references to anger and related emotions
New Auto-Interp
Negative Logits
ksen
-0.16
itive
-0.16
atives
-0.15
etto
-0.15
podob
-0.15
ardon
-0.14
tracer
-0.14
Garn
-0.14
etten
-0.14
raya
-0.14
POSITIVE LOGITS
Ang
0.27
els
0.22
Ang
0.22
gota
0.21
лий
0.20
ELS
0.20
los
0.20
eline
0.19
(ang
0.19
ang
0.19
Activations Density 0.014%