INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     was
    0.75
    d
    0.61
    0.61
    0.60
     ו
    0.59
    ing
    0.59
    0.58
    i
    0.58
    e
    0.57
    ä
    0.55
    POSITIVE LOGITS
    Until
    0.73
    1
    0.73
     Until
    0.65
    until
    0.64
    September
    0.60
    0.59
    Кла
    0.58
    Исто
    0.57
    cze
    0.57
    pengaruhi
    0.57
    Act Density 0.022%

    No Known Activations