INDEX
Explanations
words related to causing harm or pain
terms associated with causing harm or injury
New Auto-Interp
Negative Logits
Ou
-0.86
wagen
-0.84
runner
-0.79
cube
-0.74
cius
-0.73
clinton
-0.65
elling
-0.65
chrom
-0.64
McKenna
-0.64
Monaco
-0.63
POSITIVE LOGITS
inflicted
1.01
inflic
0.91
wounds
0.88
inflicting
0.87
inflict
0.85
hesda
0.85
havoc
0.85
lehem
0.81
veter
0.80
olon
0.80
Activations Density 0.023%