INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ك
    0.35
     some
    0.29
    the
    0.29
    ка
    0.27
    و
    0.27
    0.27
    ه
    0.26
    一个
    0.26
     verschied
    0.26
    }}$,
    0.26
    POSITIVE LOGITS
     activism
    0.26
     extremism
    0.23
     évaluation
    0.23
     enlightenment
    0.23
     deception
    0.22
     archery
    0.22
     acidity
    0.21
     artistry
    0.21
     predation
    0.21
    📫
    0.21
    Act Density 0.378%

    No Known Activations