INDEX
Explanations
mentions of racism, harmful/discriminatory content, or policy-style refusals explaining why hateful content can't be provided.
New Auto-Interp
Negative Logits
Unexpected
0.87
Graphics
0.83
الماء
0.79
Unexpected
0.78
InnoDB
0.78
ከናወ
0.78
Graphics
0.77
Physics
0.76
Acrobat
0.76
शॉट
0.76
POSITIVE LOGITS
perpetuated
1.83
patriarchal
1.76
dehuman
1.75
perpetuate
1.73
misog
1.71
authoritarian
1.71
perpet
1.71
capitalism
1.69
totalitarian
1.67
imperialism
1.65
Activations Density 1.108%