INDEX
    Explanations

    lists numbers after punctuation

    New Auto-Interp
    Negative Logits
    is
    0.56
    п
    0.55
    But
    0.55
    стю
    0.50
    ре
    0.50
    נת
    0.49
    пу
    0.48
    ار
    0.48
    However
    0.48
    0.47
    POSITIVE LOGITS
    VILLE
    0.63
    <unused2156>
    0.63
    <unused2152>
    0.63
     يج
    0.63
    <unused193>
    0.63
     JOHN
    0.62
     showcased
    0.61
    <unused2033>
    0.61
    بھی
    0.61
     unveiled
    0.61
    Act Density 1.307%

    No Known Activations