INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    uate
    -0.74
    ãĥ¼ãĥĨãĤ£
    -0.73
    imental
    -0.72
    ï¸
    -0.68
     Werewolf
    -0.68
    LESS
    -0.67
    uated
    -0.65
    iating
    -0.65
    UAL
    -0.64
     assistants
    -0.64
    POSITIVE LOGITS
    oks
    1.20
    oked
    1.16
    chet
    1.04
    ppo
    1.03
    opa
    1.00
    pper
    0.99
    pped
    0.97
    tch
    0.95
    ppers
    0.94
    bones
    0.94
    Act Density 0.048%

    No Known Activations