INDEX
Explanations
instances of self-reflection and expressions of doubt or criticism
New Auto-Interp
Negative Logits
Impossible
-0.17
riere
-0.16
rière
-0.16
lech
-0.15
gili
-0.15
YLON
-0.15
/apis
-0.14
Impossible
-0.14
uye
-0.14
krom
-0.14
POSITIVE LOGITS
na
0.35
mistake
0.32
naive
0.30
naï
0.28
foolish
0.27
mistakes
0.27
folly
0.26
éĶĻ误
0.25
Na
0.25
mistaken
0.24
Activations Density 0.037%