INDEX
    Explanations

    the word "way" with varying activation levels

    New Auto-Interp
    Negative Logits
    iasco
    -0.73
    lict
    -0.72
    uster
    -0.72
     livest
    -0.70
    usters
    -0.69
    uctor
    -0.64
    iners
    -0.64
    uates
    -0.63
     fumes
    -0.62
    ividual
    -0.62
    POSITIVE LOGITS
    fare
    1.18
    ward
    1.18
    finding
    1.05
    points
    1.04
    point
    1.03
    forward
    0.96
    bill
    0.93
    WARD
    0.91
    Forward
    0.89
    cross
    0.89
    Act Density 0.029%

    No Known Activations