INDEX
Explanations
words related to causing harm or negative consequences
mentions of harm
New Auto-Interp
Negative Logits
ARCH
-0.70
liner
-0.65
Pione
-0.64
handy
-0.63
ipel
-0.63
Seasons
-0.60
ourn
-0.60
Jer
-0.59
uncture
-0.59
arity
-0.59
POSITIVE LOGITS
harm
1.31
onies
1.25
lessly
1.20
harms
1.06
lessness
0.92
harming
0.86
Harm
0.85
endanger
0.83
harmed
0.83
espie
0.81
Activations Density 0.008%