INDEX
    Explanations

    text related to harmful, intentional actions

    terms related to malicious intent or harmful actions

    New Auto-Interp
    Negative Logits
    ĸļ
    -1.26
    arist
    -0.86
    ills
    -0.81
    akeru
    -0.76
    gdala
    -0.73
    ļéĨĴ
    -0.72
    illy
    -0.71
    hene
    -0.71
    ère
    -0.69
    blance
    -0.69
    POSITIVE LOGITS
    ly
    1.23
     intent
    1.06
     mischief
    0.87
     behaviour
    0.87
     activity
    0.84
     behavi
    0.82
     payload
    0.82
    icious
    0.82
    LY
    0.81
     malicious
    0.80
    Act Density 0.021%

    No Known Activations