INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     such
    -0.90
     както
    -0.81
    łych
    -0.79
    ujące
    -0.79
    done
    -0.75
    called
    -0.75
    ijn
    -0.75
     known
    -0.74
    whatever
    -0.73
    ТА
    -0.73
    POSITIVE LOGITS
     literally
    1.66
     Literally
    1.62
    literally
    1.52
     pun
    1.48
     puns
    1.43
     буквально
    1.37
     literal
    1.34
     literalmente
    1.32
    Literally
    1.23
     pardon
    1.17
    Act Density 0.035%

    No Known Activations