INDEX
    Explanations

    assertive statements or judgments

    phrases that express safety and certainty

    New Auto-Interp
    Negative Logits
    tons
    -0.71
    ĸļ
    -0.64
    listed
    -0.63
     Saving
    -0.63
     mattered
    -0.62
     millenn
    -0.58
    ories
    -0.57
     objectionable
    -0.57
     IMAGES
    -0.57
    pieces
    -0.57
    POSITIVE LOGITS
     assume
    1.39
     conclude
    1.24
     speculate
    1.15
     say
    1.07
     presume
    1.05
     expect
    0.99
     suggest
    0.99
     criticize
    0.98
     argue
    0.98
     ask
    0.94
    Act Density 0.071%

    No Known Activations