INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     hate
    -1.25
     hates
    -1.16
    hate
    -1.09
     hating
    -1.05
     Hate
    -1.02
    Hate
    -1.02
     hated
    -1.00
     HATE
    -0.94
     dislikes
    -0.93
     dislike
    -0.93
    POSITIVE LOGITS
     ویکی‌پدیا
    0.66
     Roskov
    0.63
    Története
    0.59
     وتسجيلات
    0.57
     termica
    0.56
    脚注の使い方
    0.55
     sagesse
    0.54
    AsUp
    0.54
     nahilalakip
    0.54
     numéros
    0.54
    Act Density 0.014%

    No Known Activations