INDEX
Explanations
words or phrases indicating a decision or a conclusion
repeated phrases implying a conclusion or result
New Auto-Interp
Negative Logits
ulton
-0.68
archives
-0.67
inen
-0.66
>>
-0.65
tein
-0.63
NYT
-0.61
ellen
-0.61
/-
-0.61
reads
-0.60
=]
-0.60
POSITIVE LOGITS
stairs
0.95
river
0.90
graded
0.89
stairs
0.87
grading
0.86
sidx
0.82
vote
0.76
redes
0.74
grades
0.73
WARD
0.66
Activations Density 0.025%