INDEX
Explanations
terms related to allegory and references to specific ethnicities or identities
New Auto-Interp
Negative Logits
aldi
-0.16
eler
-0.16
mund
-0.15
ler
-0.15
illon
-0.15
atsu
-0.15
stral
-0.15
yonel
-0.15
leigh
-0.15
oler
-0.15
POSITIVE LOGITS
andro
0.25
querque
0.18
zheimer
0.17
kest
0.17
WAYS
0.17
azar
0.16
igned
0.16
igators
0.16
ameda
0.15
onso
0.15
Activations Density 0.138%