INDEX
Explanations
phrases related to choices and voluntary actions
New Auto-Interp
Head Attr Weights
0:0.04
1:0.01
2:0.07
3:0.09
4:0.28
5:0.03
6:0.05
7:0.14
8:0.04
9:0.05
10:0.09
11:0.05
Negative Logits
gered
-1.80
uncture
-1.69
ggles
-1.67
ensable
-1.64
ggle
-1.60
onent
-1.60
ankind
-1.60
iferation
-1.59
functional
-1.57
ewitness
-1.56
POSITIVE LOGITS
simplicity
1.72
mild
1.58
minimalist
1.55
scraps
1.53
龍�
1.51
sunset
1.51
underdog
1.45
caveat
1.43
Mk
1.40
quieter
1.38
Activations Density 0.002%