INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    .
    -0.72
     just
    -0.60
    just
    -0.59
     Just
    -0.57
    <eos>
    -0.55
     saja
    -0.54
    ↵↵
    -0.53
    :
    -0.53
      
    -0.52
    Just
    -0.52
    POSITIVE LOGITS
     myſelf
    1.28
     itſelf
    1.18
     Theſe
    1.11
     ―――――
    1.06
     uſed
    1.04
     themſelves
    1.04
     faſt
    1.03
     ſeveral
    1.02
     anſ
    1.02
     himſelf
    1.02
    Act Density 0.031%

    No Known Activations