INDEX
Explanations
parentheses used in the text
opening parentheses in the text
New Auto-Interp
Negative Logits
tears
-0.94
starters
-0.70
Bees
-0.69
highlights
-0.68
oranges
-0.68
counselors
-0.65
windows
-0.64
bows
-0.64
tamp
-0.63
Secrets
-0.63
POSITIVE LOGITS
albeit
1.58
non
1.35
sic
1.34
yet
1.31
possibly
1.10
rather
1.07
but
1.05
literally
1.03
anti
1.03
sometimes
1.02
Activations Density 0.075%