INDEX
Explanations
sentences indicating understanding or realization
New Auto-Interp
Negative Logits
ãģ°ãģĭãĤĬ
-0.15
IRROR
-0.15
eni
-0.15
æĢ¥
-0.15
asz
-0.15
aná
-0.14
ensis
-0.14
ature
-0.14
former
-0.13
atk
-0.13
POSITIVE LOGITS
exactly
0.20
instantly
0.19
Exactly
0.19
deep
0.18
Exactly
0.17
immediately
0.16
they
0.16
beyond
0.16
[((
0.16
ape
0.15
Activations Density 0.052%