INDEX
Explanations
occurrences of the word "The" and its variations, indicating a focus on specific articles
"The" followed by titles
New Auto-Interp
Negative Logits
aarrggbb
-1.09
SequentialGroup
-0.96
featureID
-0.95
<unused14>
-0.91
<unused8>
-0.91
[@BOS@]
-0.91
<unused41>
-0.91
<unused43>
-0.91
<unused28>
-0.91
<unused3>
-0.91
POSITIVE LOGITS
The
1.30
The
1.21
THE
1.02
THE
0.98
the
0.89
ethe
0.66
ザ
0.60
ザ
0.52
sthe
0.50
↵
0.49
Activations Density 0.168%