INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Efq
    -1.86
     Theſe
    -1.84
     myſelf
    -1.80
     itſelf
    -1.73
     Monfieur
    -1.66
     raiſ
    -1.62
     Jefus
    -1.57
     pleaſure
    -1.49
     whoſe
    -1.47
     ſeveral
    -1.47
    POSITIVE LOGITS
    1.21
    2
    1.04
    0.88
    1
    0.85
    <eos>
    0.83
     I
    0.82
    ↵↵
    0.81
    ...
    0.80
    0.79
     P
    0.77
    Act Density 0.006%

    No Known Activations