INDEX
Explanations
references to societal issues and social justice
references to marginalized or affected groups of people
New Auto-Interp
Negative Logits
kamp
-0.77
ob
-0.72
opoly
-0.70
¨
-0.68
osate
-0.68
onis
-0.66
ointment
-0.65
escription
-0.65
ILY
-0.65
orate
-0.64
POSITIVE LOGITS
who
1.17
wishing
1.15
pesky
1.00
entrusted
0.94
whom
0.92
who
0.92
tasked
0.89
fortunate
0.89
interested
0.87
involved
0.87
Activations Density 0.071%