INDEX
Explanations
instances of strong language or racial and ethnic slurs
New Auto-Interp
Negative Logits
orgh
-0.18
Wilkinson
-0.14
rix
-0.14
istringstream
-0.14
hare
-0.14
850
-0.14
957
-0.13
jezd
-0.13
ikki
-0.13
ores
-0.13
POSITIVE LOGITS
prof
0.49
curse
0.44
swear
0.41
swearing
0.40
curs
0.39
curses
0.36
Prof
0.36
prof
0.34
Curse
0.34
obsc
0.34
Activations Density 0.135%