INDEX
Explanations
words related to causing harm or injury
instances of the word "hurt."
New Auto-Interp
Negative Logits
aut
-0.76
uther
-0.75
clerosis
-0.71
arch
-0.69
aer
-0.67
atching
-0.67
vironment
-0.65
gran
-0.65
liner
-0.65
au
-0.65
POSITIVE LOGITS
hurt
1.14
hurting
0.90
hurts
0.90
onies
0.87
ful
0.82
lehem
0.81
losers
0.81
igue
0.78
badly
0.77
ting
0.76
Activations Density 0.008%