INDEX
Explanations
references to purity or pure concepts
New Auto-Interp
Negative Logits
shelf
-0.19
perature
-0.16
uality
-0.15
itr
-0.15
reuse
-0.15
atical
-0.15
sel
-0.15
chy
-0.15
mun
-0.14
ÏĥÏĩ
-0.14
POSITIVE LOGITS
bred
0.34
pure
0.30
Pure
0.28
pure
0.26
Pure
0.26
foy
0.25
st
0.24
PURE
0.23
ç²
0.23
eing
0.23
Activations Density 0.015%