INDEX
Explanations
negations and expressions of contradiction
New Auto-Interp
Negative Logits
be
-0.40
Be
-0.32
(be
-0.29
be
-0.29
Be
-0.27
.Be
-0.23
(Be
-0.21
-be
-0.21
/be
-0.20
be
-0.20
POSITIVE LOGITS
need
0.24
seem
0.21
need
0.19
belong
0.19
NotExist
0.18
deserve
0.18
Need
0.18
tend
0.17
have
0.16
care
0.16
Activations Density 0.214%