INDEX
Explanations
specific groups of people based on characteristics or affiliations
phrases related to discrimination and targeted hate speech
New Auto-Interp
Negative Logits
staking
-0.74
thumbnails
-0.74
flex
-0.73
mares
-0.69
flows
-0.69
Alert
-0.67
blocks
-0.67
orders
-0.65
amar
-0.65
ulations
-0.65
POSITIVE LOGITS
particular
1.23
person
1.10
specific
1.08
subset
1.06
individual
1.05
constituent
1.00
deity
1.00
entity
0.97
piece
0.96
perpetrator
0.95
Activations Density 0.431%