INDEX
Explanations
self-described or proclaimed identities or affiliations
phrases that refer to self-identifying labels or descriptors
New Auto-Interp
Negative Logits
Shoes
-0.81
perature
-0.81
isson
-0.74
utton
-0.73
vertisement
-0.72
inson
-0.71
von
-0.71
reau
-0.71
elight
-0.71
orrow
-0.71
POSITIVE LOGITS
adherent
0.88
believer
0.81
caliphate
0.80
atheist
0.76
millennial
0.75
badass
0.72
bigot
0.70
democratic
0.70
socialist
0.70
pacif
0.69
Activations Density 0.060%