INDEX
    Explanations

    linear regression/predictor

    New Auto-Interp
    Negative Logits
    ktCap
    0.58
     Каждый
    0.53
    instrList
    0.53
    aceous
    0.53
    upaten
    0.52
    🔯
    0.52
    🚇
    0.52
    ھر
    0.52
    erzo
    0.52
    aktadır
    0.51
    POSITIVE LOGITS
     the
    0.60
     a
    0.57
     an
    0.51
     
    0.49
     shame
    0.48
     hatred
    0.48
     what
    0.47
     बा
    0.47
     humiliation
    0.47
    ওয়
    0.46
    Act Density 0.000%

    No Known Activations