INDEX
Explanations
instances of identification and self-reference
New Auto-Interp
Negative Logits
elman
-0.16
uries
-0.16
lund
-0.16
Barrett
-0.15
kowski
-0.14
asta
-0.14
icl
-0.13
ombie
-0.13
anche
-0.13
ANCH
-0.13
POSITIVE LOGITS
озем
0.15
804
0.15
abor
0.14
olor
0.14
/design
0.14
agnost
0.14
ourse
0.14
witter
0.13
Fen
0.13
cheiden
0.13
Activations Density 0.026%