INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    S
    -1.36
     myſelf
    -1.30
    ſelf
    -1.26
     itſelf
    -1.23
    T
    -1.18
     Efq
    -1.16
    C
    -1.16
    P
    -1.13
     themſelves
    -1.13
     pleaſure
    -1.12
    POSITIVE LOGITS
    0.60
    the
    0.54
    '
    0.53
    -
    0.51
    .
    0.51
    ,
    0.51
    ↵↵
    0.50
     (
    0.49
    /
    0.48
    _
    0.47
    Act Density 0.178%

    No Known Activations