INDEX
Explanations
phrases related to introspection and self-reflection
New Auto-Interp
Negative Logits
icit
-0.17
ãģĵãģĿ
-0.16
ören
-0.15
SCORE
-0.14
аж
-0.14
lein
-0.14
à¤īत
-0.14
adora
-0.14
ick
-0.14
ãĥ¼ãĥŀ
-0.14
POSITIVE LOGITS
nels
0.15
izza
0.15
ระ
0.15
reflect
0.14
reflect
0.14
aires
0.14
reflection
0.14
emb
0.13
opaque
0.13
Eins
0.13
Activations Density 0.021%