INDEX
    Explanations

    mentions of failure or underperformance

    instances of the word "failing" or its variations

    New Auto-Interp
    Negative Logits
    atar
    -0.74
    auts
    -0.67
    entle
    -0.65
    eous
    -0.64
    abb
    -0.63
    Works
    -0.63
     arom
    -0.62
    atri
    -0.62
     Hyd
    -0.60
    arf
    -0.59
    POSITIVE LOGITS
     failing
    3.62
     failure
    2.08
    failed
    1.83
     fail
    1.83
     failed
    1.81
    Failure
    1.79
     failures
    1.78
     Failure
    1.71
    fail
    1.69
     fails
    1.66
    Act Density 0.015%

    No Known Activations