INDEX
Explanations
wrong, unethical, disrespectful, problematic
New Auto-Interp
Negative Logits
pressured
0.71
influenz
0.71
🙂
0.70
ุงเทพ
0.70
Advantage
0.67
risky
0.66
kesulitan
0.66
advantage
0.66
منفی
0.65
보다는
0.63
POSITIVE LOGITS
abhor
1.33
heinous
1.33
egregious
1.30
abomin
1.29
violation
1.28
affront
1.24
atrocious
1.21
outrage
1.21
disgraceful
1.19
disgrace
1.18
Activations Density 0.380%