INDEX
    Explanations

    Russian text

    New Auto-Interp
    Negative Logits
    onse
    -0.08
     octave
    -0.08
    _medium
    -0.07
     middels
    -0.07
    (anim
    -0.07
     contra
    -0.07
    converter
    -0.07
    orth
    -0.07
    Wild
    -0.07
    @All
    -0.07
    POSITIVE LOGITS
     GPT
    0.11
    GPT
    0.09
     البشر
    0.08
     training
    0.08
     supervising
    0.08
    urator
    0.08
     jailbreak
    0.08
     تدريب
    0.08
    训练
    0.08
     humans
    0.07
    Act Density 0.012%

    No Known Activations