INDEX
Explanations
terms related to harm, including physical, emotional, or potential danger
mentions of harm, particularly in relation to various contexts and populations
New Auto-Interp
Negative Logits
umen
-0.67
uren
-0.65
ometer
-0.64
READ
-0.63
enhagen
-0.63
Fan
-0.61
aten
-0.60
Completed
-0.59
filled
-0.59
Base
-0.58
POSITIVE LOGITS
harm
3.87
harms
2.86
harmed
2.15
Harm
2.06
harm
1.98
harming
1.94
hurt
1.65
damage
1.61
harmful
1.57
endanger
1.53
Activations Density 0.021%