INDEX
Explanations
phrases indicating self-reflection or self-awareness
New Auto-Interp
Negative Logits
toy
-0.16
bild
-0.16
hal
-0.15
ë§ī
-0.15
lm
-0.15
Dahl
-0.15
illos
-0.15
content
-0.15
illas
-0.14
content
-0.14
POSITIVE LOGITS
erdale
0.18
762
0.15
assen
0.15
gie
0.15
362
0.15
-même
0.14
ipsis
0.14
ker
0.14
76
0.14
nier
0.14
Activations Density 0.053%