INDEX
    Explanations

    phrases indicating seriousness or significant concern

    New Auto-Interp
    Negative Logits
    ugs
    -0.17
    inely
    -0.16
    olars
    -0.16
    oca
    -0.15
    ERRU
    -0.15
    abh
    -0.15
    esa
    -0.14
    ále
    -0.14
    avour
    -0.14
    ÏĮ
    -0.14
    POSITIVE LOGITS
     likelihood
    0.31
     probability
    0.29
     honesty
    0.25
     intents
    0.23
     practical
    0.23
     honestly
    0.23
     cand
    0.22
     fairness
    0.22
     odds
    0.21
     accounts
    0.21
    Act Density 0.036%

    No Known Activations