INDEX
Explanations
bad followed by negative descriptions
New Auto-Interp
Negative Logits
مزید
0.41
ಸರ್
0.41
Sil
0.41
ベージュ
0.39
もう
0.38
エ
0.38
Sip
0.38
ochrony
0.37
سر
0.36
производителя
0.36
POSITIVE LOGITS
bad
0.66
坏
0.66
BAD
0.60
Bad
0.59
খারাপ
0.57
stereotypes
0.57
बद
0.57
minton
0.55
nesses
0.54
kötü
0.53
Activations Density 0.043%