INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     σε
    0.41
    dır
    0.40
     dotycz
    0.38
    もら
    0.38
     gebruiken
    0.36
    ק
    0.36
     THEY
    0.35
     welke
    0.34
     في
    0.34
    اة
    0.34
    POSITIVE LOGITS
     to
    0.63
    К
    0.50
    СТ
    0.44
     for
    0.43
    0.43
    ↵↵
    0.42
    یس
    0.41
    0.41
     It
    0.40
    to
    0.40
    Act Density 0.001%

    No Known Activations