INDEX
Explanations
negations and words that imply exclusion or limitation in statements
New Auto-Interp
Negative Logits
oux
-0.16
luv
-0.16
599
-0.15
urd
-0.15
anel
-0.14
ndon
-0.14
imizer
-0.14
ĺ认
-0.14
constitutional
-0.14
ÙĦÙĪ
-0.14
POSITIVE LOGITS
afraid
0.22
judgment
0.21
necessarily
0.21
judgement
0.20
rein
0.19
conform
0.19
limit
0.18
just
0.18
conventional
0.18
merely
0.18
Activations Density 0.244%