INDEX
    Explanations

    predict how humans rate

    New Auto-Interp
    Negative Logits
    HALF
    0.45
    َع
    0.43
    canceled
    0.43
    ومي
    0.42
    реди
    0.41
    ̝
    0.41
    half
    0.40
    0.40
     upset
    0.40
    pyridin
    0.40
    POSITIVE LOGITS
     PDA
    0.47
    ərd
    0.46
     Bibliography
    0.44
     Stove
    0.44
    ský
    0.43
     سایر
    0.43
     Materials
    0.43
    ኛውም
    0.43
     CTA
    0.43
     Gebä
    0.43
    Act Density 0.003%

    No Known Activations