INDEX
    Explanations

    the neuron activations spike on the adjective “new,” indicating it detects uses of the word “new.”

    New Auto-Interp
    Negative Logits
     sofas
    -0.08
     امر
    -0.07
    (face
    -0.07
    vrd
    -0.07
    against
    -0.07
    basket
    -0.07
    lerinden
    -0.07
     against
    -0.07
     troubled
    -0.07
     Sitting
    -0.07
    POSITIVE LOGITS
     nev
    0.06
     závod
    0.06
     bev
    0.06
    LOUR
    0.06
     Shib
    0.06
     Northwest
    0.05
    _UINT
    0.05
    _currency
    0.05
     Claudia
    0.05
     conditioned
    0.05
    Act Density 0.026%

    No Known Activations