INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    dorf
    -0.08
    ิดข
    -0.07
    ork
    -0.07
     Moran
    -0.07
    laughter
    -0.06
     Dirt
    -0.06
     crew
    -0.06
     todd
    -0.06
     Friedman
    -0.06
    lick
    -0.06
    POSITIVE LOGITS
     Se
    0.19
    Se
    0.17
     se
    0.17
     SE
    0.16
    SE
    0.16
    se
    0.16
    -se
    0.15
    -Se
    0.14
    /se
    0.14
    _se
    0.12
    Act Density 0.027%

    No Known Activations