INDEX
    Explanations

    phrases that reflect personal introspection and self-reflection

    New Auto-Interp
    Negative Logits
    ez
    -0.16
    rava
    -0.16
    imo
    -0.15
    fern
    -0.15
     bargain
    -0.14
    asers
    -0.14
    .mit
    -0.14
    indsight
    -0.14
    äge
    -0.13
    aná
    -0.13
    POSITIVE LOGITS
     ways
    0.26
     Ways
    0.18
    象
    0.17
     how
    0.17
     differently
    0.16
     ramifications
    0.15
    ering
    0.15
    owitz
    0.14
    ulus
    0.14
     worst
    0.14
    Act Density 0.068%

    No Known Activations