INDEX
Explanations
negations and phrases indicating exceptions
New Auto-Interp
Negative Logits
essen
-0.16
.persistent
-0.15
leton
-0.15
atural
-0.15
dex
-0.15
adu
-0.15
ayd
-0.14
ernel
-0.14
Vak
-0.14
adic
-0.14
POSITIVE LOGITS
necessarily
0.28
withstanding
0.21
ori
0.21
vice
0.19
merely
0.18
ché
0.18
just
0.17
just
0.17
ecessarily
0.16
ivor
0.16
Activations Density 0.040%