INDEX
Explanations
expressions related to nationalism and identity
New Auto-Interp
Negative Logits
enko
-0.14
Tier
-0.14
eree
-0.14
feminist
-0.14
ltre
-0.14
Bias
-0.14
kova
-0.14
Femin
-0.14
ubu
-0.14
earer
-0.13
POSITIVE LOGITS
identity
0.36
identity
0.34
Identity
0.32
Identity
0.29
-national
0.29
nationalism
0.27
national
0.27
national
0.27
identities
0.26
nation
0.26
Activations Density 0.127%