INDEX
Explanations
human rights violations and abuses
New Auto-Interp
Negative Logits
fury
0.45
愤怒
0.43
hurting
0.42
憤
0.40
harm
0.40
baddies
0.40
traged
0.39
اتھار
0.39
harmed
0.39
adversity
0.39
POSITIVE LOGITS
arbitrary
0.99
torture
0.89
extra
0.80
tort
0.80
Tort
0.78
Tort
0.76
tort
0.75
Arbit
0.72
ekstra
0.71
executions
0.70
Activations Density 0.011%