INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     are
    0.77
     was
    0.62
    {
    0.58
     not
    0.58
     prejudiced
    0.56
     severely
    0.55
     inherently
    0.54
     zowel
    0.53
    haired
    0.52
     supremely
    0.52
    POSITIVE LOGITS
    ת
    0.77
    thema
    0.68
    The
    0.64
    ли
    0.63
    נו
    0.63
    n
    0.62
    0.60
    0.59
    ла
    0.56
    time
    0.56
    Act Density 0.016%

    No Known Activations