INDEX
Explanations
phrases related to making choices or decisions
New Auto-Interp
Head Attr Weights
0:0.02
1:0.01
2:0.16
3:0.28
4:0.12
5:0.02
6:0.04
7:0.10
8:0.05
9:0.03
10:0.05
11:0.05
Negative Logits
��
-1.64
Nanto
-1.62
handwriting
-1.46
azon
-1.46
Rath
-1.44
underestimated
-1.35
lett
-1.34
ortment
-1.34
00007
-1.33
laugh
-1.32
POSITIVE LOGITS
anymore
1.86
>)
1.72
tarians
1.61
ught
1.60
anke
1.59
acly
1.58
specifics
1.57
ocalypse
1.56
schild
1.55
ependence
1.53
Activations Density 0.028%