INDEX
Explanations
language related to swearing and racial slurs
New Auto-Interp
Negative Logits
orgh
-0.17
hare
-0.14
jezd
-0.14
ÑģÑĤоÑĢ
-0.13
yre
-0.13
957
-0.13
æĭ
-0.13
yearly
-0.13
Wilkinson
-0.13
-0.13
POSITIVE LOGITS
prof
0.49
swear
0.46
curse
0.44
swearing
0.43
curs
0.40
curses
0.36
Prof
0.36
prof
0.35
obsc
0.34
Curse
0.34
Activations Density 0.164%