INDEX
    Explanations

    goal use meaning protected

    New Auto-Interp
    Negative Logits
     […]
    1.05
    <unused2223>
    0.99
    <unused2197>
    0.97
    <unused2206>
    0.95
    […]
    0.94
    <unused345>
    0.92
     (…)
    0.91
    <unused2169>
    0.86
    ),[
    0.86
    ).[
    0.84
    POSITIVE LOGITS
     ihe
    1.66
     o
    1.61
     u
    1.55
     t
    1.53
     lhe
    1.53
     tn
    1.50
     ao
    1.49
     nt
    1.49
     r
    1.45
     ia
    1.45
    Act Density 0.017%

    No Known Activations