INDEX
Negative Logits
pleasantly
-0.09
roomy
-0.08
mentors
-0.08
moder
-0.08
photos
-0.07
mv
-0.07
reš
-0.07
_ball
-0.07
EOS
-0.07
awesome
-0.07
POSITIVE LOGITS
propaganda
0.15
ruthless
0.14
neoliberal
0.14
harmful
0.14
misguided
0.14
blatant
0.14
涉嫌
0.14
奸
0.13
unethical
0.13
malicious
0.13
Activations Density 0.305%