INDEX
Explanations
references to specific species or biological classifications
specific words followed by another word
New Auto-Interp
Negative Logits
AddTagHelper
-1.24
queſta
-1.15
<unused41>
-1.13
featureID
-1.13
<unused43>
-1.13
<pad>
-1.13
<unused17>
-1.12
<unused23>
-1.12
<unused8>
-1.12
[@BOS@]
-1.12
POSITIVE LOGITS
The
0.75
↵
0.71
I
0.64
You
0.63
0.63
1
0.62
I
0.60
'
0.60
It
0.60
We
0.59
Activations Density 0.000%