INDEX
    Explanations

    discussions about harmful statements or language that imply threats or violence

    New Auto-Interp
    Negative Logits
     tartalomajánló
    -0.42
    queryInterface
    -0.41
    Autoritní
    -0.36
    ViewInit
    -0.36
     overzicht
    -0.35
     виправивши
    -0.35
     scales
    -0.34
     stories
    -0.34
    Previews
    -0.34
    HtmlAttribute
    -0.34
    POSITIVE LOGITS
     uttered
    1.55
     uttering
    1.13
     utterance
    1.09
     spoken
    1.07
    uttered
    1.01
     pronunci
    1.00
     muttered
    1.00
    spoken
    0.96
    Spoken
    0.92
     Spoken
    0.92
    Act Density 0.605%

    No Known Activations