INDEX
Explanations
indications of challenges or difficulties in various contexts
New Auto-Interp
Head Attr Weights
0:0.04
1:0.04
2:0.07
3:0.16
4:0.06
5:0.07
6:0.05
7:0.09
8:0.03
9:0.04
10:0.14
11:0.16
Negative Logits
"},{"-3.02
"></
-2.96
</
-2.81
】
-2.71
)</
-2.66
\">
-2.53
�
-2.44
},{"-2.37
">
-2.35
›
-2.32
POSITIVE LOGITS
bably
2.40
whiff
2.21
kidding
2.21
quirks
1.94
oops
1.93
apiece
1.90
Canaver
1.86
metic
1.86
yawn
1.84
yip
1.83
Activations Density 0.001%