INDEX
Explanations
the letter 't' appearing with high activations
instances of the negation "didn't."
New Auto-Interp
Negative Logits
Reduced
-0.64
Compar
-0.61
itiz
-0.61
arsen
-0.60
lining
-0.60
edIn
-0.59
Pop
-0.59
soType
-0.59
Strongh
-0.58
Dise
-0.58
POSITIVE LOGITS
bother
0.99
hesitate
0.91
necessarily
0.89
hesitated
0.77
exactly
0.75
apest
0.75
hes
0.75
actic
0.75
dare
0.74
ificate
0.74
Activations Density 0.059%