INDEX
    Explanations

    This neuron responds strongly to the word “helpful” (as in the phrase “a helpful and … response”).

    New Auto-Interp
    Negative Logits
    inction
    -0.07
    manifest
    -0.07
    えた
    -0.07
     critique
    -0.07
     variety
    -0.06
    δο
    -0.06
    ữa
    -0.06
    出去
    -0.06
    -expanded
    -0.06
    .ColumnHeadersHeightSizeMode
    -0.06
    POSITIVE LOGITS
    VERN
    0.06
     yards
    0.06
     Shib
    0.06
     वर
    0.06
     řid
    0.06
     ponto
    0.06
     Losing
    0.06
     Osw
    0.06
     Nissan
    0.06
     Leer
    0.06
    Act Density 0.014%

    No Known Activations