INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     purpoſe
    -1.05
    ſelves
    -0.85
     Theſe
    -0.85
     Houſe
    -0.84
     Efq
    -0.83
    Lycka
    -0.83
     pleaſure
    -0.80
     NDEBUG
    -0.80
     fhew
    -0.79
     Majefty
    -0.79
    POSITIVE LOGITS
    .
    0.74
     for
    0.71
     that
    0.70
     because
    0.62
     such
    0.60
     like
    0.60
     caused
    0.59
    !
    0.59
     called
    0.59
     known
    0.59
    Act Density 0.028%

    No Known Activations