INDEX
Explanations
questions or statements posing a contrary position
phrases that express negation or doubt
New Auto-Interp
Negative Logits
furt
-0.77
ãĤ£
-0.71
Ö¼
-0.68
Fra
-0.68
urst
-0.66
het
-0.63
ELD
-0.62
eteenth
-0.61
vironment
-0.60
facts
-0.60
POSITIVE LOGITS
hin
0.85
?".
0.85
sooner
0.83
?!"
0.79
?",
0.79
!?"
0.76
?]
0.76
?"
0.75
?).
0.75
?),
0.74
Activations Density 0.098%