INDEX
Explanations
negative content related to violence, discrimination, and offensive language
references to social issues and intolerance towards various identity groups
New Auto-Interp
Negative Logits
onen
-0.57
Mous
-0.56
noon
-0.52
Dangerous
-0.52
Nich
-0.51
Piper
-0.51
lyak
-0.50
20439
-0.50
Passage
-0.50
Architects
-0.50
POSITIVE LOGITS
etc
1.20
etc
1.01
â̦)
0.84
whatever
0.71
ect
0.68
â̦
0.65
cknow
0.61
cheat
0.60
welf
0.58
â̦
0.58
Activations Density 0.372%