INDEX
    Explanations

    The neuron fires on phrases expressing disapproval or that something “is not okay.”

    New Auto-Interp
    Negative Logits
     Saying
    -0.07
    اير
    -0.07
     area
    -0.06
     equilibrium
    -0.06
     calcul
    -0.06
    -specific
    -0.06
     there
    -0.06
    -0.06
     showed
    -0.06
     triggered
    -0.06
    POSITIVE LOGITS
     phê
    0.07
    (Vertex
    0.07
     this
    0.07
    (';
    0.07
    (urls
    0.06
     DERP
    0.06
     WX
    0.06
     음악
    0.06
    .setView
    0.06
    _Bl
    0.06
    Act Density 0.058%

    No Known Activations