INDEX
    Explanations

    terms related to harmful intent or behavior

    terms related to malicious behavior or intent

    New Auto-Interp
    Negative Logits
    ĸļ
    -0.87
    arist
    -0.85
    ILA
    -0.76
    ills
    -0.74
    hetti
    -0.70
     Passage
    -0.70
    alon
    -0.69
     Gap
    -0.69
    quart
    -0.69
     Prayer
    -0.67
    POSITIVE LOGITS
     malicious
    1.18
     mischief
    0.96
     intent
    0.94
     payload
    0.87
    icious
    0.86
    ly
    0.82
    vertising
    0.78
     behavi
    0.78
     behaviour
    0.77
    fully
    0.76
    Act Density 0.005%

    No Known Activations