INDEX
    Explanations

    affirmative statements emphasizing certainty or agreement

    New Auto-Interp
    Negative Logits
    stad
    -0.14
    gens
    -0.14
    friend
    -0.14
    EG
    -0.14
    .Encoding
    -0.13
    بÙĪØ±
    -0.13
    rak
    -0.13
    hyth
    -0.13
    زار
    -0.13
    illa
    -0.13
    POSITIVE LOGITS
    ernet
    0.16
     indeed
    0.16
    uche
    0.15
    rahim
    0.15
    ordo
    0.15
    versation
    0.15
    etti
    0.15
    ÛĮات
    0.14
    éĻħ
    0.14
    Stride
    0.14
    Act Density 0.010%

    No Known Activations