INDEX
Explanations
negative or limiting phrases involving actions and outcomes
New Auto-Interp
Negative Logits
cens
-0.70
understatement
-0.68
uders
-0.63
undrum
-0.62
opausal
-0.62
obo
-0.62
geries
-0.60
false
-0.60
hi
-0.60
Hate
-0.59
POSITIVE LOGITS
necessarily
0.93
etheless
0.86
conclusive
0.81
cially
0.81
guarantee
0.69
infall
0.69
specifics
0.66
exact
0.65
nonetheless
0.63
statistically
0.63
Activations Density 0.355%