INDEX
Explanations
structural elements or markers in the text, such as brackets or punctuation
New Auto-Interp
Head Attr Weights
0:0.09
1:0.23
2:0.08
3:0.09
4:0.06
5:0.03
6:0.10
7:0.06
8:0.03
9:0.04
10:0.05
11:0.08
Negative Logits
Si
-3.14
Mour
-2.96
Machines
-2.77
ô
-2.62
Reviewer
-2.60
Johnson
-2.56
separ
-2.52
Popular
-2.52
cho
-2.49
JO
-2.48
POSITIVE LOGITS
baseline
5.72
Unch
3.80
base
3.56
elines
3.43
normalized
3.42
peanuts
3.08
STAND
2.94
bas
2.92
ummies
2.85
base
2.82
Activations Density 0.000%