INDEX
Explanations
terms related to institutions and historical figures associated with oppression or inequality
New Auto-Interp
Negative Logits
================================================================
-0.18
../../../
-0.17
.Unicode
-0.15
numer
-0.15
Æł
-0.14
stroy
-0.14
áŀ¶
-0.13
ancybox
-0.13
íĴĪ
-0.13
ents
-0.13
POSITIVE LOGITS
erman
0.18
itan
0.17
ier
0.17
/he
0.17
berg
0.15
-water
0.15
efeller
0.15
ermann
0.15
lear
0.15
DTD
0.15
Activations Density 0.767%