INDEX
Explanations
phrases related to intense negative actions or situations
descriptors that convey extreme negativity or violence
New Auto-Interp
Negative Logits
sembly
-0.71
Bundle
-0.70
ourced
-0.70
cession
-0.69
ools
-0.69
ITNESS
-0.69
ploma
-0.69
FU
-0.67
OPLE
-0.66
pty
-0.66
POSITIVE LOGITS
retribution
0.98
merciless
0.97
punishments
0.94
honesty
0.92
retaliation
0.92
repression
0.90
Slaughter
0.88
punishment
0.87
assault
0.87
unfor
0.87
Activations Density 0.108%