INDEX
Explanations
phrases related to making threats against others
phrases related to threats and the intent to cause harm
New Auto-Interp
Negative Logits
Parables
-0.75
cellent
-0.73
emis
-0.72
learners
-0.71
fortable
-0.70
ortunate
-0.69
eret
-0.69
erning
-0.69
kered
-0.67
admirable
-0.67
POSITIVE LOGITS
blackmail
0.98
boycott
0.97
wrath
0.97
derail
0.97
quit
0.94
veto
0.94
sue
0.94
arrest
0.91
drown
0.90
ruin
0.90
Activations Density 0.151%