INDEX
Explanations
instances of self-correction or admission of mistakes in communication
New Auto-Interp
Negative Logits
obs
-0.15
ÙģÙĨ
-0.14
Whe
-0.14
mand
-0.14
ế
-0.14
acon
-0.14
Ãĸn
-0.14
odge
-0.14
mj
-0.14
æİ§
-0.13
POSITIVE LOGITS
æĺ¯æĪij
0.17
bine
0.17
meant
0.17
earlier
0.16
åĪļæīį
0.15
(æ°´
0.15
previous
0.15
previously
0.14
OPS
0.14
oversight
0.14
Activations Density 0.223%