INDEX
Explanations
phrases related to race or racial issues
terms related to religion and racial identity
New Auto-Interp
Negative Logits
irable
-0.71
reality
-0.71
agonist
-0.69
iffe
-0.68
roud
-0.67
osponsors
-0.67
agonists
-0.65
exhib
-0.65
challeng
-0.63
ieg
-0.63
POSITIVE LOGITS
religiously
2.59
racially
1.05
sling
0.71
slack
0.65
Tus
0.62
rompt
0.60
Ins
0.58
nw
0.58
onne
0.58
atically
0.57
Activations Density 0.011%