INDEX
    Explanations

    self-awareness and discovery

    New Auto-Interp
    Negative Logits
     aptly
    0.37
     eponymous
    0.37
    athon
    0.36
     provved
    0.36
     об
    0.35
    ปลี่ยน
    0.35
    学家
    0.35
    Zobacz
    0.35
    стом
    0.35
     dossier
    0.34
    POSITIVE LOGITS
     are
    0.44
     jsou
    0.44
     sono
    0.43
     fémin
    0.43
     není
    0.42
     thay
    0.41
     isn
    0.40
     dific
    0.40
     nejsou
    0.40
     dificultades
    0.40
    Act Density 0.004%

    No Known Activations