INDEX
Explanations
phrases indicating commitment or progress
New Auto-Interp
Head Attr Weights
0:0.07
1:0.04
2:0.11
3:0.14
4:0.02
5:0.05
6:0.02
7:0.10
8:0.03
9:0.02
10:0.33
11:0.01
Negative Logits
versus
-2.03
Arg
-2.03
or
-2.02
Differences
-2.01
?,
-2.00
offending
-1.99
Generic
-1.98
Common
-1.92
odan
-1.86
dich
-1.85
POSITIVE LOGITS
congratulations
3.00
improved
2.87
successfully
2.81
benefited
2.80
Congratulations
2.73
ready
2.72
enjoying
2.71
thank
2.69
enjoys
2.68
hoped
2.67
Activations Density 0.435%