INDEX
Negative Logits
нельзя
0.69
不能
0.62
不允许
0.61
refusing
0.58
cannot
0.57
refusal
0.57
запре
0.57
prohibits
0.56
refus
0.56
싫
0.56
POSITIVE LOGITS
远离
0.46
steer
0.46
DO
0.45
ste
0.44
ven
0.44
DO
0.41
ter
0.41
cumin
0.39
itori
0.38
p
0.38
Activations Density 0.034%