INDEX
Explanations
intensifiers or modifiers that express exaggeration or extremes
New Auto-Interp
Head Attr Weights
0:0.06
1:0.03
2:0.05
3:0.16
4:0.03
5:0.05
6:0.02
7:0.06
8:0.03
9:0.02
10:0.42
11:0.03
Negative Logits
intention
-2.37
neutral
-2.03
intentions
-2.02
nor
-2.00
hope
-1.93
unaffected
-1.93
lication
-1.92
endeav
-1.90
anticip
-1.88
neutral
-1.87
POSITIVE LOGITS
dstg
2.74
quicker
2.40
eeper
2.37
faster
2.37
hotter
2.17
MUCH
2.10
tighter
2.10
worse
2.07
clearer
2.04
yo
2.03
Activations Density 0.018%