INDEX
    Explanations

    names, initials, or abbreviations

    New Auto-Interp
    Negative Logits
    -
    0.81
    T
    0.66
    N
    0.66
    1
    0.63
    3
    0.62
    S
    0.59
    V
    0.59
    2
    0.58
     -
    0.58
    /
    0.57
    POSITIVE LOGITS
    .—
    0.61
    .–
    0.59
    ।,
    0.59
    .~\
    0.56
    .;
    0.55
    .".,
    0.55
    .,
    0.54
    .{
    0.54
    .--
    0.54
    .,"
    0.53
    Act Density 0.011%

    No Known Activations