INDEX
Explanations
derogatory terms or slurs
New Auto-Interp
Negative Logits
ÙİØ£
-0.16
ÙİØ³
-0.14
ÏĮÏģ
-0.14
ÑĢоз
-0.14
hear
-0.13
lier
-0.13
pdata
-0.13
ziej
-0.13
Trib
-0.13
fon
-0.13
POSITIVE LOGITS
ardin
0.19
assed
0.15
kp
0.15
orde
0.14
anine
0.14
ayscale
0.14
夢
0.14
ngo
0.14
ż
0.13
Whale
0.13
Activations Density 0.033%