INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    æĥħåĨµ
    -0.15
     exploit
    -0.14
    reds
    -0.14
    ÑĪло
    -0.14
     charge
    -0.14
     debt
    -0.13
    (éĩij
    -0.13
    ged
    -0.13
     poil
    -0.13
    Ao
    -0.13
    POSITIVE LOGITS
    lesai
    0.17
    atrice
    0.16
    моÑĢ
    0.16
    lope
    0.15
    uta
    0.15
    mate
    0.15
    uat
    0.15
    atee
    0.14
    haul
    0.14
    ieee
    0.14
    Act Density 0.051%

    No Known Activations