INDEX
    Explanations

    disallowed content

    New Auto-Interp
    Negative Logits
    -demand
    -0.09
    Cru
    -0.09
    Demand
    -0.09
     amado
    -0.08
    Crypt
    -0.08
    Inet
    -0.08
     Demand
    -0.08
    -growing
    -0.08
     Inet
    -0.08
     amad
    -0.08
    POSITIVE LOGITS
     jail
    0.09
    は禁止
    0.08
     जेल
    0.08
     toxic
    0.08
     a
    0.08
     jailbreak
    0.08
     psycho
    0.08
    0.08
     терап
    0.08
     Jail
    0.08
    Act Density 0.939%

    No Known Activations