INDEX
    Explanations

    concepts related to logic and reasoning

    New Auto-Interp
    Negative Logits
    a
    -0.15
    aze
    -0.15
    iod
    -0.15
    idon
    -0.15
     ke
    -0.14
    p
    -0.14
     Posts
    -0.14
    aris
    -0.14
    ully
    -0.14
    ìĸij
    -0.14
    POSITIVE LOGITS
    apl
    0.16
    eon
    0.15
    ÑĤÑĶ
    0.15
    ĵåIJį
    0.15
    ERY
    0.15
    akash
    0.15
     Pointer
    0.14
    ity
    0.14
    ITY
    0.14
    stype
    0.14
    Act Density 0.035%

    No Known Activations