INDEX
    Explanations

    sentences discussing individuals and their experiences or roles

    New Auto-Interp
    Negative Logits
    less
    -0.35
    ılığı
    -0.34
    IAL
    -0.33
     uygun
    -0.33
     no
    -0.31
     وقد
    -0.30
    ノリ
    -0.30
     some
    -0.29
    üng
    -0.29
     did
    -0.29
    POSITIVE LOGITS
     every
    0.82
    每一次
    0.80
    every
    0.80
     Chwiliwch
    0.79
    Every
    0.79
     ſte
    0.77
     Every
    0.76
     Anſ
    0.73
     Мексичка
    0.73
     każ
    0.71
    Act Density 0.458%

    No Known Activations