INDEX
    Explanations

    hate speech and abuse refusal

    New Auto-Interp
    Negative Logits
     somewhat
    0.48
     slightly
    0.48
     কিছুটা
    0.45
     dependiendo
    0.44
     약간
    0.44
     pleasantly
    0.43
     depending
    0.43
     mainly
    0.42
    少し
    0.41
     optional
    0.40
    POSITIVE LOGITS
     Даже
    0.51
     Needless
    0.47
     مهما
    0.46
    ऐसे
    0.44
     навіть
    0.44
     Even
    0.43
    Needless
    0.43
    regardless
    0.43
    គ្មាន
    0.42
    rocities
    0.41
    Act Density 0.500%

    No Known Activations