INDEX
Explanations
phrases indicating decisive actions and outcomes
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.09
3:0.14
4:0.02
5:0.04
6:0.05
7:0.11
8:0.07
9:0.19
10:0.06
11:0.15
Negative Logits
Smith
-1.10
YR
-1.09
LER
-1.08
olini
-1.05
Jindal
-1.03
ーク
-1.02
ーティ
-0.98
gets
-0.97
arse
-0.96
Daily
-0.95
POSITIVE LOGITS
sealing
1.44
seal
1.42
rity
1.28
antha
1.25
envelop
1.24
keye
1.14
uter
1.13
sealed
1.12
borders
1.11
secrecy
1.10
Activations Density 0.005%