INDEX
Explanations
language related to activism and social justice
New Auto-Interp
Negative Logits
shave
-0.15
Leakage
-0.14
rophy
-0.14
essim
-0.14
ossal
-0.14
anship
-0.13
Risk
-0.13
é£İéĻ©
-0.13
azar
-0.13
ennen
-0.12
POSITIVE LOGITS
equality
0.34
equal
0.34
justice
0.34
equity
0.31
rights
0.30
fairness
0.28
fair
0.28
liberties
0.25
human
0.25
equal
0.25
Activations Density 0.191%