INDEX
Explanations
instances of authorship or attribution in text
New Auto-Interp
Negative Logits
ContentLoaded
-0.16
oras
-0.16
nelly
-0.15
abay
-0.15
isto
-0.15
_COPY
-0.14
cargo
-0.14
loud
-0.14
/document
-0.14
mada
-0.14
POSITIVE LOGITS
pto
0.20
means
0.20
rne
0.18
laws
0.17
virtue
0.17
gone
0.17
ÅĤa
0.16
hra
0.16
gg
0.15
dint
0.15
Activations Density 0.149%