INDEX
    Explanations

    attacks/abuse

    This neuron detects personal‐attack language, i.e. tokens related to insults or abusive “personal attacks.”

    New Auto-Interp
    Negative Logits
    แล
    -0.08
     tester
    -0.07
     deductions
    -0.06
    	template
    -0.06
    -0.06
    -independent
    -0.06
     даних
    -0.06
    -0.06
     سازمان
    -0.06
     lässt
    -0.06
    POSITIVE LOGITS
    atial
    0.07
    "."
    0.06
     조선
    0.06
    Khi
    0.06
    0.06
    ursion
    0.06
    ॉस
    0.06
    amacare
    0.06
    :Int
    0.05
    bate
    0.05
    Act Density 0.012%

    No Known Activations