INDEX
    Explanations

    phrases about potential outcomes or capabilities

    New Auto-Interp
    Negative Logits
     somehow
    -0.19
    untime
    -0.15
    žÃŃ
    -0.15
    aron
    -0.15
    Probably
    -0.15
    à¸Īะ
    -0.14
    egl
    -0.14
    irs
    -0.14
    pery
    -0.14
    .overlay
    -0.14
    POSITIVE LOGITS
     sometimes
    0.34
    sometimes
    0.28
     Sometimes
    0.26
     be
    0.24
    Sometimes
    0.24
    ometimes
    0.24
     often
    0.23
     range
    0.22
    often
    0.21
     oft
    0.20
    Act Density 0.156%

    No Known Activations