INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Injection
    -0.07
    خدم
    -0.06
    -human
    -0.06
     Files
    -0.06
     Kids
    -0.06
    Holy
    -0.06
    NAL
    -0.06
    -0.06
    -0.06
    79
    -0.06
    POSITIVE LOGITS
     stumbling
    0.07
     Establish
    0.06
     فارسی
    0.06
     maze
    0.06
     startled
    0.06
     bacheca
    0.06
    Ao
    0.06
    Fuck
    0.06
     artış
    0.06
    леж
    0.06
    Act Density 0.054%

    No Known Activations