INDEX
Explanations
verbs or phrases related to causing harm or suffering
words related to causing harm or suffering
New Auto-Interp
Negative Logits
runner
-0.82
cube
-0.78
cius
-0.77
wagen
-0.75
Ou
-0.71
kj
-0.70
chrom
-0.69
Blackwell
-0.69
mom
-0.68
zo
-0.67
POSITIVE LOGITS
inflicted
1.32
inflict
1.04
inflicting
1.03
veter
1.00
inflic
0.97
adolesc
0.96
hesda
0.89
terness
0.88
wounds
0.86
eleph
0.84
Activations Density 0.012%