INDEX
    Explanations

    The neuron fires on words that describe tricks, traps, or schemes (e.g. “setup,” “trap,” “swindling,” “blackmail”).

    New Auto-Interp
    Negative Logits
     Lincoln
    -0.07
     entertaining
    -0.06
    _OWNER
    -0.06
     Duty
    -0.06
     July
    -0.06
     دم
    -0.06
    _hat
    -0.06
     Coral
    -0.06
     Shelley
    -0.06
    िब
    -0.06
    POSITIVE LOGITS
     norske
    0.08
    ικές
    0.07
    рогра
    0.06
    ість
    0.06
    0.06
    ussed
    0.06
    Operator
    0.06
     siding
    0.06
     определить
    0.06
    ejména
    0.06
    Act Density 0.075%

    No Known Activations