INDEX
    Explanations

    phrases that compare and contrast positive and negative aspects

    phrases indicating a contrast between positive and negative aspects

    New Auto-Interp
    Negative Logits
    vernment
    -0.82
    sbm
    -0.78
    ICLE
    -0.75
    20439
    -0.74
    quit
    -0.69
     Intake
    -0.69
    ebin
    -0.67
    ciating
    -0.67
    sections
    -0.66
    igraph
    -0.66
    POSITIVE LOGITS
     evil
    1.01
     brightest
    0.99
    evil
    0.99
     cheerful
    0.96
     shiny
    0.96
     noble
    0.96
     honorable
    0.94
     fluffy
    0.94
     virtuous
    0.91
     tidy
    0.91
    Act Density 0.110%

    No Known Activations