INDEX
    Explanations

    phrases indicating attribution or assigning responsibility

    phrases that indicate causation or blame

    New Auto-Interp
    Negative Logits
    Dur
    -0.82
    esson
    -0.67
     doesnt
    -0.62
     Advertisement
    -0.61
    ById
    -0.60
     Means
    -0.60
     nets
    -0.59
     didnt
    -0.58
     belts
    -0.57
    SourceFile
    -0.57
    POSITIVE LOGITS
     blame
    1.55
    asted
    1.43
    asting
    1.34
    ying
    1.27
    wered
    1.25
    lled
    1.24
    iling
    1.16
    pless
    1.13
    gg
    1.13
    iled
    1.10
    Act Density 0.081%

    No Known Activations