INDEX
Explanations
words or phrases indicating negation or disproval
negations in statements
New Auto-Interp
Negative Logits
stakes
-0.73
å¥
-0.69
éĥ
-0.68
æ©
-0.67
iers
-0.67
LG
-0.65
Inventory
-0.64
PDATE
-0.64
ãĤ¼
-0.64
WER
-0.62
POSITIVE LOGITS
icably
1.33
icable
1.17
necessarily
1.15
epad
1.10
hin
1.08
exactly
0.99
orious
0.97
eworthy
0.90
quite
0.87
yet
0.86
Activations Density 0.079%