INDEX
Explanations
phrases indicating a comparison or contrast
New Auto-Interp
Negative Logits
even
-0.18
even
-0.16
EVEN
-0.16
uchen
-0.16
actually
-0.14
oui
-0.14
IDI
-0.14
Lair
-0.14
teri
-0.14
chwitz
-0.13
POSITIVE LOGITS
929
0.18
ÏĦί
0.16
umas
0.15
eln
0.15
Versions
0.15
until
0.15
ãģªãĤī
0.15
BorderStyle
0.15
лаж
0.14
Güven
0.14
Activations Density 0.027%