INDEX
Explanations
phrases and terms indicating negation or warning
New Auto-Interp
Negative Logits
ViewFeatures
-0.96
enumi
-0.83
dymyr
-0.80
kloped
-0.80
EconPapers
-0.75
клопе
-0.74
writeField
-0.72
enderror
-0.72
หวัด
-0.72
DoubleQuotes
-0.69
POSITIVE LOGITS
NOT
1.07
ONLY
1.04
BOTH
1.01
ANY
0.97
MUST
0.95
VERY
0.93
ONLY
0.92
ANYONE
0.91
SAME
0.91
HUGE
0.91
Activations Density 0.096%