INDEX
Explanations
phrases indicating attempts or efforts to solve problems
New Auto-Interp
Negative Logits
ifix
-0.17
McM
-0.16
exact
-0.16
isten
-0.15
qus
-0.15
lov
-0.15
ddy
-0.15
-of
-0.14
slik
-0.14
ãĥ³ãĤ¯
-0.14
POSITIVE LOGITS
arcer
0.15
etz
0.14
YGON
0.14
astore
0.14
åŃ
0.14
нен
0.14
åĽŀ
0.14
cak
0.14
ahl
0.13
conti
0.13
Activations Density 0.038%