INDEX
Explanations
words related to identity
New Auto-Interp
Head Attr Weights
0:0.02
1:0.01
2:0.04
3:0.06
4:0.04
5:0.03
6:0.47
7:0.05
8:0.05
9:0.06
10:0.06
11:0.04
Negative Logits
ngth
-1.60
BIL
-1.41
selves
-1.32
loads
-1.29
irlf
-1.26
drm
-1.26
bread
-1.23
letal
-1.22
Union
-1.22
loads
-1.20
POSITIVE LOGITS
Stra
1.48
bush
1.31
aroo
1.31
umbered
1.28
Blanc
1.27
hatch
1.23
Bris
1.23
Gau
1.22
combe
1.22
uzz
1.21
Activations Density 0.001%