INDEX
Explanations
phrases that convey summaries or conclusions
New Auto-Interp
Head Attr Weights
0:0.02
1:0.01
2:0.12
3:0.09
4:0.11
5:0.03
6:0.04
7:0.27
8:0.03
9:0.04
10:0.08
11:0.10
Negative Logits
jab
-1.68
emies
-1.54
ariat
-1.52
iaz
-1.48
thouse
-1.47
cest
-1.46
rex
-1.44
culosis
-1.38
stadt
-1.37
urn
-1.37
POSITIVE LOGITS
exquisite
1.34
simple
1.33
loosely
1.28
differently
1.23
circumst
1.23
purely
1.21
impe
1.21
neatly
1.21
unamb
1.20
beautifully
1.20
Activations Density 0.011%