INDEX
Explanations
phrases indicating realizations or discoveries
New Auto-Interp
Negative Logits
I
-0.16
seen
-0.15
j
-0.15
seen
-0.14
anon
-0.14
fik
-0.14
initial
-0.14
ãģ§ãģĹãĤĩãģĨ
-0.14
oci
-0.14
known
-0.14
POSITIVE LOGITS
åİŁæĿ¥
0.23
actually
0.20
actually
0.20
Actually
0.18
Actually
0.17
indeed
0.17
iÄįky
0.16
竣
0.16
@",
0.15
agal
0.14
Activations Density 0.238%