INDEX
    Explanations

    health dangers

    This neuron detects language describing things as dangerous, harmful, or posing health and safety risks.

    New Auto-Interp
    Negative Logits
     cường
    -0.06
    <label
    -0.06
     sunshine
    -0.06
    ichael
    -0.06
     awkward
    -0.06
     meilleurs
    -0.06
     ctor
    -0.06
     gameplay
    -0.06
     mejores
    -0.06
     Janet
    -0.06
    POSITIVE LOGITS
    ––
    0.08
    0.07
    ()):↵
    0.07
    ):-
    0.07
     sik
    0.07
    0.06
     Vari
    0.06
    _)
    0.06
     %}↵
    0.06
    0.06
    Act Density 0.053%

    No Known Activations