INDEX
Explanations
phrases indicating inability or difficulties
New Auto-Interp
Negative Logits
jac
-0.17
Jou
-0.16
ymb
-0.15
iner
-0.15
azel
-0.14
bon
-0.14
uel
-0.14
aller
-0.14
ÑĥмÑĥ
-0.14
erm
-0.14
POSITIVE LOGITS
harm
0.16
оÑģп
0.16
ůst
0.15
YLE
0.14
adir
0.14
Nationwide
0.14
harming
0.14
elsewhere
0.14
ugin
0.14
Rolled
0.13
Activations Density 0.075%