INDEX
Explanations
references to racial and ethnic language or slurs
New Auto-Interp
Negative Logits
ienne
-0.16
ror
-0.15
ÑĸнÑĮ
-0.15
yre
-0.15
zej
-0.14
semicolon
-0.14
yro
-0.13
102
-0.13
altru
-0.13
ë¬
-0.13
POSITIVE LOGITS
epith
0.25
language
0.23
obsc
0.22
curse
0.21
words
0.20
vul
0.20
swear
0.20
coarse
0.20
swearing
0.19
sworn
0.19
Activations Density 0.074%