INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     অবশ্য
    -0.08
    ustra
    -0.08
    andid
    -0.07
    ilas
    -0.07
     meaningful
    -0.07
    vlak
    -0.07
    оглас
    -0.07
    ұ
    -0.07
    认可
    -0.07
     인정
    -0.07
    POSITIVE LOGITS
     safer
    0.33
     safest
    0.31
     safe
    0.29
     Safe
    0.25
    Safe
    0.25
     err
    0.25
    safe
    0.25
     safety
    0.24
    -safe
    0.23
     cautious
    0.23
    Act Density 0.067%

    No Known Activations