INDEX
    Explanations

    phrases connecting one thing to a larger group or category, often emphasizing the diversity or quantity of examples

    phrases indicating examples or instances

    New Auto-Interp
    Negative Logits
    same
    -0.87
    ioned
    -0.75
    alos
    -0.70
    cffff
    -0.69
    iol
    -0.67
    then
    -0.66
    nob
    -0.65
    awaru
    -0.64
    never
    -0.64
    shared
    -0.64
    POSITIVE LOGITS
     examples
    1.07
     example
    1.01
     scratching
    1.01
     sampling
    0.98
     iceberg
    0.90
     sample
    0.90
     illustration
    0.86
     symptom
    0.86
     glimpse
    0.82
     anecdotal
    0.82
    Act Density 0.129%

    No Known Activations