INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     pleaſure
    -0.91
     للمعارف
    -0.91
     purpoſe
    -0.83
     whoſe
    -0.79
     greateſt
    -0.77
     poffe
    -0.76
     Efq
    -0.76
     myſelf
    -0.75
    Personensuche
    -0.75
     fevere
    -0.73
    POSITIVE LOGITS
    ↵↵
    0.57
    <bos>
    0.54
     $\
    0.43
    enumi
    0.43
     De
    0.41
    Â
    0.41
     acidente
    0.40
    0.40
     -
    0.40
     the
    0.39
    Act Density 0.035%

    No Known Activations