INDEX
Explanations
phrases indicating contrast or disagreement
signals or markers indicating the beginning or end of a text segment
New Auto-Interp
Negative Logits
代
-0.72
pecially
-0.69
Unit
-0.60
senal
-0.58
ãĤ©
-0.58
renheit
-0.56
omever
-0.55
è¦ļéĨĴ
-0.55
ular
-0.55
However
-0.55
POSITIVE LOGITS
nonetheless
1.12
nevertheless
0.90
etheless
0.80
still
0.64
scept
0.64
anyway
0.64
beware
0.64
undeniable
0.62
doubts
0.62
caveats
0.60
Activations Density 0.644%