INDEX
Explanations
concepts related to self-awareness and knowledge
New Auto-Interp
Negative Logits
inval
-0.16
ære
-0.15
azu
-0.15
aub
-0.15
izr
-0.14
orce
-0.14
utura
-0.14
fir
-0.14
wear
-0.14
withd
-0.14
POSITIVE LOGITS
about
0.20
_about
0.16
aton
0.15
Thur
0.15
understanding
0.15
aman
0.14
eye
0.14
pire
0.14
ollen
0.14
is
0.14
Activations Density 0.202%