INDEX
Explanations
references to personal identity and self-perception
New Auto-Interp
Negative Logits
antu
-0.17
ogne
-0.15
rang
-0.15
sdale
-0.14
ibur
-0.14
ascular
-0.14
thro
-0.13
ovÄĽ
-0.13
otty
-0.13
_("-0.13
POSITIVE LOGITS
ãĢħ
0.16
rens
0.15
gni
0.15
leta
0.15
áºŃt
0.15
-centered
0.14
sert
0.14
rosse
0.14
oji
0.14
antry
0.14
Activations Density 0.169%