INDEX
Explanations
phrases indicating reversal or contrast, particularly when describing situations or states that are opposite to expectations
New Auto-Interp
Negative Logits
wright
-0.16
ugh
-0.15
rieve
-0.15
fell
-0.15
Stokes
-0.14
enn
-0.14
plex
-0.14
Arb
-0.14
Arbor
-0.14
vida
-0.14
POSITIVE LOGITS
ucha
0.16
Wort
0.15
jde
0.15
ews
0.14
erule
0.14
arehouse
0.14
Niet
0.14
ÑĨаÑĤÑĮ
0.13
Rule
0.13
ZW
0.13
Activations Density 0.009%