INDEX
Explanations
negative content
instances of a highly offensive racial slur (the n-word) and similar hateful/derogatory language.
New Auto-Interp
Negative Logits
回复
-0.07
_picture
-0.07
Lie
-0.07
שמים
-0.07
Authorized
-0.06
-ul
-0.06
.Information
-0.06
主力军
-0.06
劑
-0.06
_shot
-0.06
POSITIVE LOGITS
bindings
0.07
禺
0.07
unsafe
0.06
hấp
0.06
.Include
0.06
param
0.06
prést
0.06
ark
0.06
ASP
0.06
჻
0.06
Activations Density 0.715%