INDEX
Explanations
instances where the phrase "don't" is used
expressions of uncertainty or refusal
New Auto-Interp
Negative Logits
ategory
-0.75
afore
-0.73
artney
-0.72
ANG
-0.69
Agency
-0.66
Passage
-0.62
Anim
-0.62
upp
-0.59
Personality
-0.58
Antar
-0.58
POSITIVE LOGITS
't
1.31
ned
0.89
uts
0.83
ates
0.79
ÃŃ
0.78
anted
0.75
nas
0.71
kie
0.70
na
0.69
itzer
0.69
Activations Density 0.075%