INDEX
Explanations
negations and expressions of doubt or uncertainty
New Auto-Interp
Negative Logits
384
-0.16
nty
-0.16
okers
-0.14
antor
-0.14
fairly
-0.14
ubl
-0.14
ran
-0.14
xit
-0.13
adf
-0.13
bj
-0.13
POSITIVE LOGITS
THAT
0.41
_that
0.33
TOO
0.33
that
0.32
that
0.32
too
0.31
That
0.30
too
0.29
že
0.29
that
0.28
Activations Density 0.117%