INDEX
    Explanations

    mentions of potential threats or dangers in various contexts

    New Auto-Interp
    Negative Logits
    iao
    -0.22
    oya
    -0.19
    ocker
    -0.16
    icum
    -0.15
    haul
    -0.15
    estroy
    -0.15
     ple
    -0.15
     Gover
    -0.14
    ureka
    -0.14
    ocket
    -0.14
    POSITIVE LOGITS
     posed
    0.28
     Pos
    0.23
    ening
    0.22
    ened
    0.22
    posed
    0.19
     hung
    0.19
     hanging
    0.19
    å¨ģ
    0.18
    rical
    0.18
     perception
    0.18
    Act Density 0.036%

    No Known Activations