INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     accompanies
    -0.79
     livest
    -0.69
     constitu
    -0.68
     advertised
    -0.66
     agre
    -0.64
    idding
    -0.61
    nesota
    -0.59
    nar
    -0.59
     coerc
    -0.58
     dissatisf
    -0.57
    POSITIVE LOGITS
    aways
    1.26
     advantage
    1.10
    away
    0.94
     heed
    0.93
    uchi
    0.91
     aback
    0.89
     care
    0.84
    overs
    0.82
    prising
    0.80
    frey
    0.75
    Act Density 0.042%

    No Known Activations