INDEX
Explanations
the presence of certain names or identifiers, particularly those related to characters or individuals
New Auto-Interp
Negative Logits
es
-0.23
er
-0.21
esin
-0.20
ekk
-0.19
zman
-0.19
eson
-0.17
erse
-0.16
ed
-0.16
esini
-0.16
eyer
-0.16
POSITIVE LOGITS
zi
0.30
y
0.28
quierda
0.23
ionario
0.21
ze
0.20
zen
0.19
riz
0.18
abella
0.18
zych
0.17
ibilit
0.17
Activations Density 0.019%