INDEX
Explanations
phrases indicating ease or difficulty in performing actions
New Auto-Interp
Head Attr Weights
0:0.10
1:0.01
2:0.10
3:0.12
4:0.31
5:0.05
6:0.02
7:0.01
8:0.04
9:0.12
10:0.04
11:0.02
Negative Logits
Stars
-1.28
gmail
-1.26
cliff
-1.26
leased
-1.20
hya
-1.15
ached
-1.15
bon
-1.12
angel
-1.12
afety
-1.12
variance
-1.11
POSITIVE LOGITS
than
1.65
Catalog
1.41
ildo
1.37
JUST
1.29
lug
1.27
Luigi
1.24
than
1.22
Snape
1.20
aline
1.20
navigating
1.19
Activations Density 0.014%