INDEX
    Explanations

    legitimate

    New Auto-Interp
    Negative Logits
     ,
    -0.68
     hâte
    -0.68
     ;"
    -0.64
    ]--;
    -0.64
     engraçadas
    -0.63
     Forumite
    -0.60
     tiroirs
    -0.60
     larmes
    -0.60
     tiegħ
    -0.59
     luffy
    -0.59
    POSITIVE LOGITS
    y
    0.69
     the
    0.68
    i
    0.66
     they
    0.63
     it
    0.59
     and
    0.59
     a
    0.58
     with
    0.57
    war
    0.56
     is
    0.55
    Act Density 0.798%

    No Known Activations