INDEX
Explanations
references to societal concepts, structures, and issues
New Auto-Interp
Negative Logits
orie
-0.19
otty
-0.15
oge
-0.15
andes
-0.15
oria
-0.14
elow
-0.14
иÑĢов
-0.14
Downing
-0.14
ysis
-0.14
dater
-0.14
POSITIVE LOGITS
-wide
0.32
wide
0.26
/community
0.21
/world
0.19
wide
0.19
Wide
0.17
norms
0.16
/media
0.16
/system
0.15
hood
0.15
Activations Density 0.034%