INDEX
    Explanations

    references to sources and citations in text

    New Auto-Interp
    Negative Logits
    IGO
    -0.17
    .dk
    -0.15
    ropri
    -0.14
    ITO
    -0.14
    oro
    -0.14
    arella
    -0.14
    _relu
    -0.13
    .bc
    -0.13
    igo
    -0.13
    implements
    -0.13
    POSITIVE LOGITS
     ÑĥÑģ
    0.16
     shar
    0.16
    sem
    0.15
    ernel
    0.15
    RED
    0.14
     Bir
    0.14
     ar
    0.14
    onz
    0.14
    anal
    0.14
    ktor
    0.14
    Act Density 0.083%

    No Known Activations