INDEX
Explanations
negations
negations or denials in statements
New Auto-Interp
Negative Logits
Reviewer
-0.71
*/(
-0.62
iferation
-0.61
oided
-0.60
ittle
-0.60
omm
-0.58
imentary
-0.57
velt
-0.57
iber
-0.56
fixme
-0.56
POSITIVE LOGITS
etheless
0.87
deter
0.77
necessarily
0.72
entirely
0.69
exactly
0.67
rael
0.66
fooled
0.65
ravings
0.63
altogether
0.63
swayed
0.61
Activations Density 0.218%