INDEX
Explanations
words related to negative behaviors or consequences such as hatred, prejudice, harassment, violence, and rejection
references to systemic discrimination and social injustices
New Auto-Interp
Negative Logits
ufact
-0.72
çīĪ
-0.66
eus
-0.66
arten
-0.63
cko
-0.62
eport
-0.61
ppa
-0.60
utra
-0.58
arers
-0.58
illac
-0.57
POSITIVE LOGITS
imprisonment
0.70
unwanted
0.70
harassment
0.66
boredom
0.65
threats
0.65
vous
0.65
loneliness
0.63
deaths
0.63
bullying
0.62
misinformation
0.62
Activations Density 0.474%