INDEX
Explanations
themes related to introspection and self-reflection
New Auto-Interp
Negative Logits
θμ
-0.19
iel
-0.18
ëĥ¥
-0.15
cko
-0.15
RGBA
-0.15
inka
-0.15
aters
-0.14
kem
-0.14
kud
-0.14
align
-0.14
POSITIVE LOGITS
inward
0.32
wards
0.29
Internal
0.26
internal
0.26
Inside
0.26
åħ§
0.25
inside
0.25
åĨħ
0.25
Inside
0.25
outward
0.24
Activations Density 0.078%