INDEX
Explanations
phrases indicating disregard or concession
New Auto-Interp
Negative Logits
inge
-0.16
kelig
-0.15
bersome
-0.14
ýš
-0.14
arnation
-0.14
иной
-0.14
åħ»
-0.14
erguson
-0.14
umbing
-0.14
uegos
-0.13
POSITIVE LOGITS
ots
0.16
Butter
0.15
297
0.15
ÙħÙĪÙĦ
0.14
atte
0.14
uell
0.13
Rim
0.13
latter
0.13
/e
0.13
urst
0.13
Activations Density 0.024%