INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     adapted
    0.76
     ironically
    0.76
     U
    0.75
     selected
    0.75
     scattered
    0.74
     some
    0.74
     surprisingly
    0.74
     someone
    0.73
     crossed
    0.72
     Remarks
    0.72
    POSITIVE LOGITS
    :]
    1.35
    :]:
    1.18
    :-
    1.17
    ::
    1.09
    1.08
    :,
    1.05
    :])
    1.04
     :-
    1.00
     onwards
    0.96
    :)
    0.95
    Act Density 0.012%

    No Known Activations