INDEX
Explanations
phrases related to causing harm or damage to others
words or phrases that indicate negative impacts or harm to individuals or communities
New Auto-Interp
Negative Logits
uther
-0.74
alled
-0.74
aut
-0.73
iliary
-0.70
ult
-0.70
ulum
-0.68
au
-0.68
ials
-0.68
Nap
-0.67
ault
-0.67
POSITIVE LOGITS
hurting
1.11
disadvant
0.98
adolesc
0.92
harming
0.91
undermin
0.85
badly
0.85
horribly
0.83
lehem
0.82
Pwr
0.81
harmed
0.80
Activations Density 0.010%