INDEX
    Explanations

    phrases related to making assumptions or beliefs about others' intentions

    New Auto-Interp
    Negative Logits
    erman
    -0.18
    esian
    -0.18
    essler
    -0.17
    rome
    -0.16
    etting
    -0.16
    udas
    -0.15
    Ø©
    -0.15
    osl
    -0.15
    Assert
    -0.15
    ness
    -0.15
    POSITIVE LOGITS
    ably
    0.23
    /assert
    0.22
    ptions
    0.19
     Worst
    0.17
    nal
    0.17
    PTION
    0.16
    ptive
    0.16
    267
    0.16
    ively
    0.16
     worst
    0.16
    Act Density 0.026%

    No Known Activations