INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ¢
    -1.66
    ¸
    -1.59
    isor
    -1.46
    ior
    -1.45
    head
    -1.44
    sted
    -1.44
    ¡
    -1.44
    uchi
    -1.42
    -1.41
    ized
    -1.38
    POSITIVE LOGITS
     outright
    1.70
     its
    1.63
     other
    1.63
     vice
    1.59
     sexes
    1.56
     others
    1.55
     infinity
    1.53
    gger
    1.48
     slightest
    1.43
     anything
    1.43
    Act Density 0.354%

    No Known Activations