INDEX
    Explanations

    The neuron fires on tokens that signal rude or insulting language (e.g., insults and offense words).

    New Auto-Interp
    Negative Logits
     tempor
    -0.07
    .friend
    -0.06
    [a
    -0.06
     робота
    -0.06
    -0.06
     conform
    -0.06
    976
    -0.06
    -0.06
    @[
    -0.06
     Jungle
    -0.06
    POSITIVE LOGITS
     масс
    0.07
    orpor
    0.07
    ственно
    0.06
     เกม
    0.06
    elial
    0.06
     společnosti
    0.06
     země
    0.06
     Approximately
    0.06
    major
    0.06
    0.06
    Act Density 0.245%

    No Known Activations