INDEX
Explanations
terms related to different categories such as race, class, gender, and other characteristics within a social context
references to social categorizations and roles
New Auto-Interp
Negative Logits
urtles
-0.76
DonaldTrump
-0.75
Newsletter
-0.67
bledon
-0.63
ÂŃ
-0.61
Remem
-0.59
Salt
-0.59
isSpecialOrderable
-0.56
Send
-0.55
displayText
-0.55
POSITIVE LOGITS
/,
1.73
/
1.68
/)
1.55
/"
1.53
/?
1.49
/_
1.46
/(
1.43
/#
1.42
/.
1.40
combo
1.32
Activations Density 0.130%