INDEX
Explanations
adjectives or nouns related to harm or damage
terms related to the concept of harm or harmfulness
New Auto-Interp
Negative Logits
quart
-0.79
Kings
-0.71
Hun
-0.70
eely
-0.68
ARCH
-0.66
Whe
-0.65
ebus
-0.65
peak
-0.65
TeX
-0.64
gran
-0.64
POSITIVE LOGITS
harmful
1.15
harm
1.03
undermin
1.01
harms
0.88
endanger
0.85
adolesc
0.85
detrimental
0.85
consequences
0.84
contamin
0.80
harming
0.80
Activations Density 0.008%