INDEX
Explanations
punctuation marks and quotation marks, suggesting a focus on dialogue or spoken expressions in the text
New Auto-Interp
Negative Logits
drawn
-0.18
taken
-0.17
driven
-0.17
risen
-0.15
eaten
-0.15
itas
-0.15
undertaken
-0.15
given
-0.15
flown
-0.15
arisen
-0.14
POSITIVE LOGITS
didn
0.36
couldn
0.32
had
0.30
went
0.29
Didn
0.28
felt
0.27
took
0.27
did
0.27
fell
0.27
forgot
0.27
Activations Density 0.054%