INDEX
Explanations
expressions related to personal identity and self-reflection
New Auto-Interp
Negative Logits
loat
-0.15
icky
-0.14
leans
-0.14
ovice
-0.14
recommendation
-0.14
evil
-0.13
apos
-0.13
ãģ¨ãģĨ
-0.13
ırak
-0.13
odyn
-0.13
POSITIVE LOGITS
being
0.23
reality
0.21
actions
0.21
existence
0.20
Being
0.20
own
0.20
humanity
0.19
environment
0.19
worth
0.18
past
0.18
Activations Density 0.234%