INDEX
Explanations
phrases that express doubt or uncertainty
phrases expressing potential difficulties or challenges
New Auto-Interp
Negative Logits
ŃĶ
-0.73
ppo
-0.70
çīĪ
-0.64
ļéĨĴ
-0.64
ulner
-0.64
etheless
-0.64
IRT
-0.63
ĸļ
-0.63
Published
-0.63
ãĤ¸
-0.61
POSITIVE LOGITS
but
1.05
BUT
0.84
though
0.82
tho
0.82
But
0.81
anymore
0.80
but
0.78
however
0.74
But
0.73
yet
0.73
Activations Density 0.728%