INDEX
Explanations
references to actions perceived as negative or harmful to others
expressions of disbelief or shock regarding human actions or events
New Auto-Interp
Negative Logits
Occasionally
-0.64
periodically
-0.58
shortly
-0.55
endif
-0.54
Newsletter
-0.54
assures
-0.54
explains
-0.53
respectively
-0.52
incumb
-0.51
summarizes
-0.51
POSITIVE LOGITS
such
1.30
such
1.12
Such
0.97
anything
0.96
this
0.93
Such
0.93
this
0.92
THIS
0.92
these
0.87
THAT
0.85
Activations Density 0.564%