INDEX
Explanations
negation and expressions that counter claims or expectations
New Auto-Interp
Negative Logits
ybrid
-0.17
neither
-0.16
à¸ĸ
-0.16
lef
-0.16
hardly
-0.15
ä¸įäºĨ
-0.15
no
-0.15
eve
-0.15
Almost
-0.15
affle
-0.15
POSITIVE LOGITS
altogether
0.24
stint
0.22
alto
0.19
mere
0.17
greatly
0.17
Alto
0.15
merely
0.15
seriously
0.15
strictly
0.15
Overall
0.15
Activations Density 0.577%