INDEX
Explanations
references to diversity and its various dimensions, including cultural, ideological, and biological aspects
New Auto-Interp
Head Attr Weights
0:0.01
1:0.01
2:0.09
3:0.05
4:0.08
5:0.02
6:0.06
7:0.40
8:0.02
9:0.02
10:0.10
11:0.10
Negative Logits
urat
-1.47
pir
-1.45
aunder
-1.40
money
-1.38
¢
-1.33
monitor
-1.33
WARD
-1.31
raz
-1.31
�
-1.31
grand
-1.30
POSITIVE LOGITS
sexes
1.83
genders
1.82
demographics
1.66
opinion
1.63
opinions
1.54
geographically
1.51
Diversity
1.51
individuality
1.51
erning
1.46
personality
1.43
Activations Density 0.009%