INDEX
    Explanations

    the letter 'w' in various contexts, indicating a focus on its frequency in the text

    New Auto-Interp
    Negative Logits
    ={({
    -0.79
    .";
    
    -0.75
    "},
    
    -0.67
    )");
    
    -0.67
    XXXXX
    -0.67
    ).)
    -0.67
    ).]
    -0.66
    -0.66
    )');
    -0.65
    :~
    -0.64
    POSITIVE LOGITS
    w
    2.49
     w
    2.27
    𝐰
    1.06
    𝙬
    0.97
    wl
    0.94
     wh
    0.93
    ww
    0.92
    wh
    0.90
    𝑤
    0.89
    𝒘
    0.88
    Act Density 0.074%

    No Known Activations