INDEX
    Explanations

    occurrences of the letter "w"

    New Auto-Interp
    Negative Logits
     Theſe
    -1.05
     ་་
    -1.02
     Monfieur
    -0.97
     ―――――
    -0.95
     iſt
    -0.95
     myſelf
    -0.91
     Beſ
    -0.90
     themſelves
    -0.89
     ſeveral
    -0.86
     verſ
    -0.84
    POSITIVE LOGITS
     w
    1.99
     W
    1.90
    W
    1.70
    w
    1.62
     b
    1.11
     d
    0.95
     h
    0.94
     r
    0.93
    𝙬
    0.93
     g
    0.93
    Act Density 0.088%

    No Known Activations