INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     нали
    -0.07
     meld
    -0.07
    ็นท
    -0.06
    vik
    -0.06
     Cunning
    -0.06
     आप
    -0.06
    ُّ
    -0.06
    命令
    -0.06
    jh
    -0.06
    lya
    -0.06
    POSITIVE LOGITS
     Admin
    0.07
     abs
    0.06
     monet
    0.06
    filtered
    0.06
    0.06
    0.06
     Honestly
    0.06
    .sup
    0.06
     restraint
    0.06
    .Rotate
    0.06
    Act Density 0.004%

    No Known Activations