INDEX
    Explanations

    discussions surrounding threats and their implications on safety

    New Auto-Interp
    Negative Logits
    ovsky
    -0.16
    égor
    -0.15
    argas
    -0.15
    565
    -0.14
     Victims
    -0.14
    532
    -0.14
    alez
    -0.14
    artin
    -0.14
    otos
    -0.14
     tmpl
    -0.14
    POSITIVE LOGITS
     threat
    0.59
     threats
    0.52
    threat
    0.50
     Threat
    0.49
    å¨ģ
    0.49
    -threat
    0.49
    Th
    0.48
     danger
    0.42
     threatening
    0.40
     TH
    0.40
    Act Density 0.157%

    No Known Activations