INDEX
Explanations
unethical, harmful, disrespectful, unprofessional
New Auto-Interp
Negative Logits
`>=
0.40
fonts
0.40
Fonts
0.39
সতর্ক
0.39
সতর্কতা
0.39
Sweet
0.38
苦手
0.38
성능
0.38
Sweet
0.38
Performance
0.38
POSITIVE LOGITS
disrespectful
0.57
affront
0.52
tantamount
0.49
irresponsible
0.48
是一种
0.46
insulting
0.45
insult
0.45
would
0.45
taman
0.44
shameful
0.44
Activations Density 0.074%