INDEX
    Explanations

    causal language modeling

    New Auto-Interp
    Negative Logits
    pu
    0.40
     calculated
    0.38
    imated
    0.38
    calculated
    0.38
    bure
    0.38
    alone
    0.38
    calcul
    0.37
    bala
    0.36
    readers
    0.36
    foot
    0.36
    POSITIVE LOGITS
    0.43
    тков
    0.39
     дома
    0.37
    тел
    0.37
     gemak
    0.37
     Inu
    0.36
     Lam
    0.36
     emocional
    0.36
    HeaderParams
    0.36
    ;|
    0.36
    Act Density 0.000%

    No Known Activations