INDEX
Explanations
content related to harmful or offensive behavior and violations of guidelines
New Auto-Interp
Negative Logits
uble
-0.18
rego
-0.16
ocular
-0.14
лоп
-0.14
ution
-0.14
alie
-0.14
765
-0.14
ogle
-0.14
yssey
-0.14
ogany
-0.14
POSITIVE LOGITS
offensive
0.20
Offensive
0.16
nudity
0.15
invasion
0.15
Ùħبر
0.15
esin
0.15
_again
0.15
addle
0.15
Content
0.14
Heath
0.14
Activations Density 0.050%