INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     purely
    0.43
     جزء
    0.40
    irsty
    0.40
     salah
    0.40
    istu
    0.39
     quasi
    0.39
     affective
    0.39
     ($\
    0.39
    0.38
     اگلے
    0.38
    POSITIVE LOGITS
     unified
    0.88
    unified
    0.79
    Unified
    0.79
     еди
    0.73
    统一
    0.73
     Unified
    0.70
     unifying
    0.70
     monolithic
    0.64
    統一
    0.63
     encompassing
    0.62
    Act Density 0.034%

    No Known Activations