INDEX
    Explanations

    phrases expressing opinions or judgements

    phrases that indicate making a statement or providing examples

    New Auto-Interp
    Negative Logits
    xual
    -0.63
    iru
    -0.63
    uzz
    -0.62
     Completed
    -0.62
    elsen
    -0.61
    arnaev
    -0.60
    fram
    -0.60
     sidx
    -0.58
    resa
    -0.58
    enegger
    -0.57
    POSITIVE LOGITS
    laughs
    0.73
     nothing
    0.71
     least
    0.67
    hem
    0.62
    Credit
    0.57
    _>
    0.57
    pecially
    0.56
     detract
    0.56
     suffice
    0.56
    gap
    0.55
    Act Density 0.087%

    No Known Activations