INDEX
Explanations
references to a nation or national identity
New Auto-Interp
Negative Logits
sse
-0.17
orie
-0.17
Nic
-0.17
ors
-0.17
nice
-0.17
ly
-0.15
lyn
-0.15
ory
-0.15
Nice
-0.15
lett
-0.15
POSITIVE LOGITS
wide
0.31
hood
0.28
ally
0.27
nal
0.27
alse
0.26
als
0.23
ALSE
0.23
-wide
0.23
ALLY
0.22
ality
0.22
Activations Density 0.016%