INDEX
Explanations
phrases that express difficulty or inability to accomplish tasks
New Auto-Interp
Negative Logits
еÑĢк
-0.16
ÎŃÏģ
-0.15
igan
-0.15
pras
-0.14
ubl
-0.14
maz
-0.14
QT
-0.14
ÑģÑĤан
-0.14
GO
-0.14
erk
-0.14
POSITIVE LOGITS
figure
0.33
seem
0.33
figure
0.28
seems
0.27
figured
0.25
seemed
0.24
-figure
0.24
figures
0.22
figuring
0.21
Seems
0.21
Activations Density 0.079%