INDEX
    Explanations

    phrases related to various actions and their consequences

    words and phrases related to damage or consequences

    New Auto-Interp
    Negative Logits
    anus
    -0.60
    laus
    -0.56
    ª
    -0.55
    cknow
    -0.53
    giene
    -0.52
    OPLE
    -0.51
    Ħ¢
    -0.50
    rek
    -0.49
    TAIN
    -0.49
    ZI
    -0.49
    POSITIVE LOGITS
     differently
    0.79
     nicely
    0.65
     everywhere
    0.62
     lin
    0.62
     indistinguishable
    0.58
     whereas
    0.58
     anyways
    0.57
     beautifully
    0.57
     automatically
    0.57
     MUCH
    0.56
    Act Density 1.195%

    No Known Activations