INDEX
Explanations
words related to violence or intense negative experiences
references to violent or distressing imagery
New Auto-Interp
Negative Logits
PLIED
-0.88
Reviewer
-0.88
anol
-0.87
Demand
-0.86
Recommend
-0.78
Rate
-0.77
rador
-0.76
BOOK
-0.75
later
-0.75
CHAT
-0.75
POSITIVE LOGITS
bloody
0.93
noses
0.83
wounds
0.77
bast
0.77
slaughter
0.74
swath
0.74
prick
0.72
stained
0.72
swat
0.72
blood
0.71
Activations Density 0.011%