INDEX
Explanations
terms related to human rights and humanity
New Auto-Interp
Negative Logits
gf
-0.15
yr
-0.15
lying
-0.15
oned
-0.15
gi
-0.15
INCT
-0.15
hausen
-0.15
yang
-0.15
ional
-0.15
gers
-0.14
POSITIVE LOGITS
-readable
0.18
ized
0.18
izing
0.18
IGHLIGHT
0.17
ENCHMARK
0.16
male
0.16
pire
0.15
istic
0.15
itarian
0.15
ismatch
0.15
Activations Density 0.037%