INDEX
Negative Logits
অবশ্য
-0.08
ustra
-0.08
andid
-0.07
ilas
-0.07
meaningful
-0.07
vlak
-0.07
оглас
-0.07
ұ
-0.07
认可
-0.07
인정
-0.07
POSITIVE LOGITS
safer
0.33
safest
0.31
safe
0.29
Safe
0.25
Safe
0.25
err
0.25
safe
0.25
safety
0.24
-safe
0.23
cautious
0.23
Activations Density 0.067%