INDEX
Explanations
mentions of authorship and attribution in text
New Auto-Interp
Negative Logits
otts
-0.17
ź
-0.16
alin
-0.15
uren
-0.14
ailing
-0.14
l
-0.14
icz
-0.14
imson
-0.14
ze
-0.14
_FACTORY
-0.14
POSITIVE LOGITS
erm
0.16
yne
0.15
ikip
0.15
dda
0.14
undermin
0.14
meldung
0.14
shima
0.14
ãĤ¦ãĤ£
0.14
é¨
0.14
.fhir
0.14
Activations Density 0.037%