INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     themſelves
    -0.94
     Theſe
    -0.92
     Houſe
    -0.88
     himſelf
    -0.85
     itſelf
    -0.85
     Diſ
    -0.81
     ſmall
    -0.80
     myſelf
    -0.79
    whom
    -0.79
     houſe
    -0.77
    POSITIVE LOGITS
    ever
    0.86
     is
    0.79
    ,
    0.66
     was
    0.63
    e
    0.62
     we
    0.60
     has
    0.59
    se
    0.57
     continues
    0.56
    Begriffsklä
    0.56
    Act Density 0.034%

    No Known Activations