INDEX
    Explanations

    whether

    New Auto-Interp
    Negative Logits
     skoro
    -0.08
     Trom
    -0.07
     sna
    -0.07
    .Bool
    -0.07
     dre
    -0.06
     Ens
    -0.06
    abad
    -0.06
     Coord
    -0.06
    ')")↵
    -0.06
    Bring
    -0.06
    POSITIVE LOGITS
     proportion
    0.07
    score
    0.07
     Colleg
    0.06
    125
    0.06
     immoral
    0.06
    .LAZY
    0.06
     conscience
    0.06
    чат
    0.06
     innoc
    0.06
    Comparator
    0.06
    Act Density 0.000%

    No Known Activations