INDEX
Explanations
oppression, discrimination, exploitation
New Auto-Interp
Negative Logits
ية
0.57
ativos
0.52
После
0.52
Q
0.50
O
0.49
エ
0.49
árvore
0.48
D
0.48
N
0.48
after
0.47
POSITIVE LOGITS
oppression
0.70
oppressed
0.61
coercive
0.60
harassment
0.58
restrictive
0.56
abuse
0.56
discrimination
0.56
misuse
0.55
discriminatory
0.55
oppressive
0.54
Activations Density 0.443%