INDEX
    Explanations

    example explanations and code

    New Auto-Interp
    Negative Logits
     crappy
    2.17
     shitty
    2.08
     bullshit
    2.06
     kinda
    2.02
     messed
    1.93
     mensen
    1.92
     boobs
    1.90
     haha
    1.82
     kids
    1.80
     yeah
    1.79
    POSITIVE LOGITS
     strikingly
    1.77
    此外
    1.56
     remarkably
    1.55
     markedly
    1.54
    Invoke
    1.51
     crucially
    1.51
     renowned
    1.50
     unequivocal
    1.49
     unquestionably
    1.48
    regarded
    1.48
    Act Density 0.422%

    No Known Activations