INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Anywhere
    -0.08
     digno
    -0.08
     accompl
    -0.08
     indispensable
    -0.08
    -0.08
     kwalitat
    -0.08
     Fabulous
    -0.08
     在
    -0.08
     surveyed
    -0.07
     compon
    -0.07
    POSITIVE LOGITS
     malicious
    0.11
     spam
    0.11
     Spam
    0.10
     jailbreak
    0.09
     malware
    0.09
     phishing
    0.09
     trick
    0.09
     GPT
    0.09
    Spam
    0.08
    作弊
    0.08
    Act Density 0.014%

    No Known Activations