INDEX
Explanations
controversial or contrasting statements
references to generalizations and exceptions about different groups or categories of people
New Auto-Interp
Negative Logits
GENERAL
-0.71
Emin
-0.69
Motion
-0.61
oided
-0.60
Seat
-0.60
respectively
-0.60
Conservation
-0.59
NOTICE
-0.58
Consider
-0.58
ATURE
-0.58
POSITIVE LOGITS
anymore
1.68
nor
1.36
necessarily
0.95
yet
0.85
anything
0.82
yet
0.81
anybody
0.78
anywhere
0.78
acea
0.75
slightest
0.73
Activations Density 0.721%