INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     pleaſure
    -0.96
    ^(@)
    -0.96
    +#+#
    -0.93
     greateſt
    -0.86
    #+#
    -0.82
     Efq
    -0.82
     ſeveral
    -0.81
     itſelf
    -0.79
     houſe
    -0.77
    providedIn
    -0.76
    POSITIVE LOGITS
    <bos>
    1.44
    '
    0.48
    jan
    0.47
     the
    0.46
     try
    0.45
     di
    0.43
    ΙΑ
    0.42
    mopol
    0.42
     Letter
    0.42
     plan
    0.42
    Act Density 0.879%

    No Known Activations