INDEX
Explanations
specific words or phrases associated with identities or cultural markers, particularly those related to ethnicity or heritage
New Auto-Interp
Negative Logits
ãģªãĤĭ
-0.17
abant
-0.15
rane
-0.15
reb
-0.14
quito
-0.14
Mug
-0.14
incr
-0.14
Murdoch
-0.13
brook
-0.13
ستاÙĨ
-0.13
POSITIVE LOGITS
elow
0.15
hood
0.15
à¥Ĥद
0.15
etical
0.15
ëĭ¹
0.15
riba
0.14
amb
0.14
kadar
0.14
fully
0.14
eldon
0.14
Activations Density 0.047%