INDEX
    Explanations

    the word "one" with varying strengths of activation for different contexts

    references to the concept of "one" or singular items

    New Auto-Interp
    Negative Logits
    osponsors
    -1.14
    rations
    -0.93
    pees
    -0.79
    lations
    -0.79
    apons
    -0.78
    ooks
    -0.78
    etz
    -0.78
    ourses
    -0.77
    endars
    -0.77
    uts
    -0.76
    POSITIVE LOGITS
     thing
    1.28
     caveat
    1.12
     glaring
    1.12
     overarching
    1.09
     overriding
    1.05
     pecul
    1.01
     undeniable
    1.00
     drawback
    0.98
     exception
    0.96
     notable
    0.93
    Act Density 0.092%

    No Known Activations