INDEX
Explanations
references to nationalities or ethnic identities
New Auto-Interp
Negative Logits
nda
-0.17
own
-0.16
706
-0.16
omor
-0.15
awe
-0.15
/trunk
-0.14
Dra
-0.14
ss
-0.14
atar
-0.14
iece
-0.14
POSITIVE LOGITS
ertz
0.17
Ä±ÅŁÄ±k
0.15
bench
0.15
.Generated
0.15
ห
0.15
ppe
0.14
ERSHEY
0.14
ezi
0.14
ooth
0.14
меÑģÑĤ
0.14
Activations Density 0.147%