INDEX
Explanations
instances of authorship or attribution in the text
New Auto-Interp
Negative Logits
arer
-0.16
gang
-0.15
fram
-0.14
'])?
-0.14
heat
-0.14
Heat
-0.14
iffer
-0.13
ecies
-0.13
eskort
-0.13
Heat
-0.13
POSITIVE LOGITS
antro
0.16
PoÄįet
0.15
stag
0.15
AGR
0.15
λα
0.14
αγ
0.14
445
0.13
RIORITY
0.13
λά
0.13
Ñĩен
0.13
Activations Density 0.032%