INDEX
    Explanations

    questions throughout the text

    New Auto-Interp
    Negative Logits
    oire
    -0.70
    aure
    -0.70
     fread
    -0.69
    𝙫
    -0.68
    a
    -0.68
    navbar
    -0.67
     Bradley
    -0.67
    𝓭
    -0.66
    aus
    -0.65
    ade
    -0.64
    POSITIVE LOGITS
    %?
    1.86
    ?
    1.72
    ?!?
    1.66
    ؟
    1.64
    ’?
    1.59
    $?
    1.58
    ?}
    1.52
    !?
    1.52
    ?"
    1.50
    1.49
    Act Density 0.138%

    No Known Activations