INDEX
    Explanations

    terms related to responsibility and transparency in various contexts

    New Auto-Interp
    Negative Logits
    lingen
    -0.15
    aat
    -0.15
    ongan
    -0.15
    veau
    -0.14
    .Encoding
    -0.14
    ikt
    -0.14
    oin
    -0.14
    Unlock
    -0.14
    ilo
    -0.14
    è¨Ģèijī
    -0.14
    POSITIVE LOGITS
     tom
    0.15
    afil
    0.14
    ADE
    0.13
    chia
    0.13
    respons
    0.13
    iez
    0.13
    atest
    0.13
    utions
    0.13
    arse
    0.13
     Gang
    0.13
    Act Density 0.008%

    No Known Activations