INDEX
    Explanations

    phrases indicating the presence or inclusion of specific items or features

    New Auto-Interp
    Negative Logits
     же
    -0.18
    ieres
    -0.15
     pÅĻÃŃpadnÄĽ
    -0.14
    arlo
    -0.14
    ï½¥
    -0.13
    andr
    -0.13
    acos
    -0.13
    ع
    -0.13
     Others
    -0.12
    oris
    -0.12
    POSITIVE LOGITS
     both
    0.28
     neither
    0.24
     mostly
    0.23
     Ñģобой
    0.23
     mainly
    0.23
     nothing
    0.23
     elements
    0.22
     only
    0.22
     everything
    0.22
     plenty
    0.21
    Act Density 0.291%

    No Known Activations