INDEX
    Explanations

    mentions of pairs or groups of items or concepts

    New Auto-Interp
    Negative Logits
     Various
    -0.45
     various
    -0.43
     all
    -0.42
    各种
    -0.39
    Various
    -0.38
     pelbagai
    -0.38
     berbagai
    -0.37
     variés
    -0.37
    various
    -0.36
     tất
    -0.36
    POSITIVE LOGITS
    two
    0.78
     two
    0.77
     deux
    0.77
    兩種
    0.76
    兩個
    0.73
    Two
    0.71
     ujednoznacz
    0.71
     zwei
    0.71
    两个
    0.71
     dvě
    0.68
    Act Density 0.898%

    No Known Activations