INDEX
    Explanations

    phrases related to causality and explanation

    expressions indicating uncertainty or speculation

    New Auto-Interp
    Negative Logits
    ukong
    -0.74
    yna
    -0.68
     Goat
    -0.63
    uggle
    -0.62
    mop
    -0.61
     Ping
    -0.61
     Oro
    -0.60
    unks
    -0.59
    kie
    -0.59
    76561
    -0.59
    POSITIVE LOGITS
     nevertheless
    1.72
     nonetheless
    1.58
    etheless
    1.18
    still
    0.93
     still
    0.93
    theless
    0.83
     remains
    0.74
     undeniably
    0.73
     retained
    0.71
     ])
    0.69
    Act Density 0.300%

    No Known Activations