INDEX
    Explanations

    The neuron primarily detects occurrences of the word “problem.”

    New Auto-Interp
    Negative Logits
     engr
    -0.07
     Kir
    -0.06
     ent
    -0.06
     ecs
    -0.06
     contingent
    -0.06
     Evans
    -0.06
     Express
    -0.06
     courtesy
    -0.06
    .ke
    -0.06
     drives
    -0.06
    POSITIVE LOGITS
     problem
    0.17
     problems
    0.15
     Problem
    0.14
    problem
    0.12
     Problems
    0.12
    Problem
    0.11
    problems
    0.10
    mma
    0.09
    0.08
     troub
    0.08
    Act Density 0.049%

    No Known Activations