INDEX
    Explanations

    This neuron responds to words and phrases describing negative consequences, risks, or harmful outcomes.

    New Auto-Interp
    Negative Logits
     das
    -0.07
     virtual
    -0.06
    --------------------------------------------------------------------------------
    -0.06
    Your
    -0.06
     prevention
    -0.06
    者の
    -0.06
    your
    -0.06
    скую
    -0.06
    developers
    -0.06
    هوری
    -0.06
    POSITIVE LOGITS
     had
    0.07
    REFERRED
    0.07
     have
    0.07
     DST
    0.06
     Perf
    0.06
     Courier
    0.06
     directed
    0.06
     Mitt
    0.06
     Has
    0.06
    held
    0.06
    Act Density 0.072%

    No Known Activations