INDEX
Explanations
terms related to discrimination, authority, products, language, democracy, rights, scientific inquiry, integrity, maturity, benefit, leader, economy, profits, and communication
discussions about social justice issues and systemic inequalities
New Auto-Interp
Negative Logits
ãĤ©
-0.71
Vaugh
-0.65
zens
-0.59
renheit
-0.57
everal
-0.57
Rober
-0.56
elve
-0.56
Alley
-0.56
ãĤ»
-0.55
arthed
-0.54
POSITIVE LOGITS
doesnt
0.99
)?
0.83
depends
0.80
↵Âł
0.79
implies
0.75
dont
0.75
.--
0.74
(%)
0.74
¶
0.72
etc
0.72
Activations Density 1.171%