INDEX
Explanations
references to blood or violence
references to violent or graphic imagery
New Auto-Interp
Negative Logits
PLIED
-0.85
Reviewer
-0.81
Demand
-0.81
BOOK
-0.80
rador
-0.77
Rate
-0.76
anol
-0.74
YL
-0.73
agall
-0.73
Recomm
-0.70
POSITIVE LOGITS
bloody
0.83
noses
0.82
wounds
0.76
soever
0.73
streak
0.72
ãĥ£
0.71
bast
0.71
heart
0.70
diarrhea
0.70
swath
0.69
Activations Density 0.019%