INDEX
    Explanations

    words related to restrictions or limitations

    New Auto-Interp
    Negative Logits
    usi
    -0.18
    andra
    -0.15
    uh
    -0.15
    vv
    -0.15
    AWN
    -0.14
    wick
    -0.14
    isay
    -0.14
    rij
    -0.14
    ans
    -0.14
    inz
    -0.14
    POSITIVE LOGITS
     buy
    0.19
     bu
    0.18
     bt
    0.17
    byn
    0.16
     byt
    0.15
     byl
    0.15
    ëį°ìĿ´íĬ¸
    0.15
     bye
    0.14
    .ali
    0.14
    á»Ŀ
    0.14
    Act Density 0.111%

    No Known Activations