INDEX
    Explanations

    the word "won't" with high activation values

    negations or words indicating refusal or denial

    New Auto-Interp
    Negative Logits
     behavi
    -0.77
    CVE
    -0.72
     Reloaded
    -0.72
    Reviewer
    -0.72
     examiner
    -0.70
     Palestin
    -0.70
    HTTP
    -0.67
     Moroc
    -0.67
    Hardware
    -0.66
     ventilation
    -0.65
    POSITIVE LOGITS
    weet
    1.01
    ardless
    0.91
    acular
    0.91
    ruck
    0.91
    urtle
    0.91
    ravis
    0.86
    aylor
    0.81
    otally
    0.80
    itles
    0.79
    rees
    0.79
    Act Density 0.032%

    No Known Activations