INDEX
Explanations
quotes enclosed in quotation marks
New Auto-Interp
Negative Logits
honors
-0.82
favor
-0.82
honor
-0.79
nude
-0.74
bunk
-0.74
clo
-0.73
eligible
-0.72
classified
-0.71
slam
-0.71
grades
-0.71
POSITIVE LOGITS
Therefore
1.49
It
1.45
However
1.43
They
1.42
We
1.40
Whereas
1.39
There
1.38
But
1.36
Secondly
1.35
If
1.34
Activations Density 0.091%