INDEX
    Explanations

    The neuron activates on occurrences of the words “response” or “reply.”

    New Auto-Interp
    Negative Logits
    15
    -0.08
     lit
    -0.07
    75
    -0.07
    275
    -0.07
    14
    -0.07
    11
    -0.07
    50
    -0.07
     Meat
    -0.07
     beaten
    -0.07
     Hall
    -0.07
    POSITIVE LOGITS
     response
    0.13
     Response
    0.10
    responsive
    0.10
     responses
    0.10
    response
    0.10
    -response
    0.10
     responsive
    0.09
    Responses
    0.09
    Response
    0.09
     респ
    0.09
    Act Density 0.083%

    No Known Activations