INDEX
Explanations
phrases related to inappropriate or offensive content
expressions related to censorship and controversial content
New Auto-Interp
Negative Logits
patiently
-0.84
staggered
-0.79
waiting
-0.76
waits
-0.75
Luck
-0.74
luck
-0.74
Recovery
-0.74
kefeller
-0.71
ternity
-0.71
cells
-0.71
POSITIVE LOGITS
objectionable
1.44
pornographic
1.38
derogatory
1.37
blasp
1.34
inciting
1.30
offend
1.30
provocative
1.27
indecent
1.27
disrespectful
1.27
nudity
1.27
Activations Density 0.757%