INDEX
    Explanations

    words related to risks or dangers

    references to perceived dangers or risks

    New Auto-Interp
    Negative Logits
    puted
    -0.81
    mys
    -0.78
    raham
    -0.74
    bits
    -0.74
    ilts
    -0.71
    arist
    -0.71
    ria
    -0.69
    gans
    -0.68
    unes
    -0.68
    ANN
    -0.67
    POSITIVE LOGITS
     threat
    1.13
     threats
    0.98
     Threat
    0.93
     proble
    0.92
    threat
    0.89
     posed
    0.86
     deterrent
    0.85
     menace
    0.84
     challeng
    0.84
     undermin
    0.83
    Act Density 0.017%

    No Known Activations