INDEX
Explanations
list of negative attributes
New Auto-Interp
Negative Logits
dedicato
0.49
ᓐ
0.46
ornate
0.46
désigne
0.44
striées
0.43
идеально
0.41
boul
0.40
impressively
0.40
doskona
0.40
parfaitement
0.40
POSITIVE LOGITS
immoral
0.67
injustice
0.63
mismanagement
0.58
irresponsible
0.57
व्यवहार
0.55
कायद्या
0.55
hardships
0.55
unethical
0.55
selfishness
0.55
injustices
0.55
Activations Density 0.001%