INDEX
Explanations
phrases indicating approval or acknowledgment
New Auto-Interp
Head Attr Weights
0:0.02
1:0.06
2:0.13
3:0.04
4:0.02
5:0.03
6:0.05
7:0.11
8:0.32
9:0.04
10:0.06
11:0.07
Negative Logits
ials
-1.22
arrow
-1.17
zie
-1.16
yond
-1.16
arters
-1.14
rongh
-1.13
increments
-1.11
dding
-1.10
nings
-1.10
gaard
-1.10
POSITIVE LOGITS
ently
1.28
pires
1.22
belonged
1.17
ITED
1.16
edly
1.10
Offic
1.09
loudly
1.08
pired
1.06
ּ
1.05
passionately
1.03
Activations Density 0.110%