INDEX
Explanations
references to data organization and episode structure
New Auto-Interp
Head Attr Weights
0:0.04
1:0.02
2:0.04
3:0.08
4:0.08
5:0.08
6:0.03
7:0.36
8:0.06
9:0.02
10:0.06
11:0.06
Negative Logits
witz
-2.90
livious
-2.69
oly
-2.49
thouse
-2.45
aiman
-2.38
abama
-2.35
vez
-2.32
STATE
-2.31
adia
-2.30
truth
-2.27
POSITIVE LOGITS
])
2.25
chronological
2.25
numbered
2.22
Nun
2.18
Orient
2.16
Era
2.12
ply
2.12
Organization
2.11
Nas
2.10
Played
2.10
Activations Density 0.001%