INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    0
    -0.08
    1
    -0.08
    -0.07
    5
    -0.07
    man
    -0.07
     With
    -0.07
     compuls
    -0.07
     with
    -0.06
     trolling
    -0.06
    ated
    -0.06
    POSITIVE LOGITS
    daf
    0.08
    ,state
    0.07
    ("-
    0.07
    gam
    0.07
    0.07
    @\
    0.07
    ičky
    0.07
    ,ev
    0.07
    jab
    0.07
    yz
    0.06
    Act Density 2.222%

    No Known Activations