INDEX
    Explanations

    mentions of different wars or war-related terms

    occurrences of the word "wars"

    New Auto-Interp
    Negative Logits
    ATURE
    -0.65
    AUT
    -0.63
    STER
    -0.63
    SOURCE
    -0.63
    YL
    -0.60
    gow
    -0.60
    Dialogue
    -0.58
     Accuracy
    -0.57
    Asset
    -0.57
    nosis
    -0.57
    POSITIVE LOGITS
    hip
    1.33
    hips
    1.30
    pace
    1.08
    poons
    0.95
    uits
    0.94
    hops
    0.94
    pread
    0.94
    cale
    0.92
    mith
    0.87
    pite
    0.85
    Act Density 0.029%

    No Known Activations