INDEX
Explanations
prepositional phrases that explain causation or contrast
clauses that serve as explanations or justifications
New Auto-Interp
Negative Logits
+++
-0.71
Furious
-0.63
ossier
-0.62
bably
-0.62
nonetheless
-0.61
Mac
-0.60
Bucks
-0.60
sometimes
-0.60
Happ
-0.60
Frenzy
-0.60
POSITIVE LOGITS
lihood
0.70
sexes
0.69
qt
0.66
hetical
0.66
ance
0.65
Americ
0.65
hemat
0.62
rings
0.62
geries
0.62
oping
0.61
Activations Density 0.138%