INDEX
Explanations
content related to classification ratings and appropriateness for young audiences
New Auto-Interp
Negative Logits
endi
-0.15
è°±
-0.14
ropol
-0.14
plits
-0.14
ÑĥÑħ
-0.14
inand
-0.14
itra
-0.14
ragen
-0.13
Roch
-0.13
ortex
-0.13
POSITIVE LOGITS
violence
0.31
Violence
0.28
Viol
0.25
viol
0.24
violent
0.23
content
0.22
Viol
0.22
adult
0.21
-viol
0.21
viol
0.20
Activations Density 0.190%