INDEX
Explanations
phrases related to negative social behaviors such as harassment, threats, and discrimination
phrases and concepts related to abuse and social issues
New Auto-Interp
Negative Logits
çīĪ
-0.72
ufact
-0.69
arten
-0.66
ppa
-0.62
illac
-0.61
ainted
-0.59
puter
-0.59
hesion
-0.57
ahon
-0.55
ixel
-0.55
POSITIVE LOGITS
imprisonment
0.73
retribution
0.70
bullying
0.70
persecution
0.70
degradation
0.68
threats
0.67
boredom
0.67
consequ
0.67
exploitation
0.67
harassment
0.65
Activations Density 0.486%