INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     mitochond
    -0.07
    interfaces
    -0.07
    یات
    -0.07
    وجود
    -0.07
    kara
    -0.07
     Oakland
    -0.07
    ایط
    -0.07
    ileceği
    -0.06
     useClass
    -0.06
    ونی
    -0.06
    POSITIVE LOGITS
     spam
    0.14
     Spam
    0.12
    spam
    0.08
    trl
    0.07
    .streaming
    0.06
     prompted
    0.06
     Dün
    0.06
    im
    0.06
     DEL
    0.06
     strategic
    0.06
    Act Density 0.003%

    No Known Activations