INDEX
Explanations
pronouns and articles related to existential conditions or perceptions
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.08
3:0.09
4:0.20
5:0.03
6:0.27
7:0.06
8:0.04
9:0.03
10:0.05
11:0.05
Negative Logits
footprints
-1.54
ּ
-1.30
uniforms
-1.25
Tam
-1.21
DV
-1.21
Ys
-1.20
playbook
-1.18
levers
-1.16
Orig
-1.15
rosters
-1.14
POSITIVE LOGITS
theless
1.74
ouver
1.62
iful
1.56
ividual
1.55
icultural
1.54
anwhile
1.49
guiName
1.44
seless
1.43
amaz
1.43
entious
1.38
Activations Density 0.011%