INDEX
    Explanations

    phrases related to policy and government actions

    phrases related to safety and well-being

    New Auto-Interp
    Negative Logits
     confir
    -0.70
     ende
    -0.70
     misunder
    -0.65
     dismant
    -0.62
     destro
    -0.62
     Learns
    -0.59
    OSP
    -0.59
    ather
    -0.59
     Rampage
    -0.58
     reluct
    -0.58
    POSITIVE LOGITS
    ibel
    0.63
    into
    0.62
     innocent
    0.61
    oths
    0.58
     Inn
    0.58
    evil
    0.56
    rum
    0.56
     Narr
    0.56
    rily
    0.55
    metadata
    0.55
    Act Density 0.767%

    No Known Activations