INDEX
    Explanations

    simple and direct concepts

    New Auto-Interp
    Negative Logits
    al
    1.74
    ي
    1.70
    ar
    1.56
    ある
    1.47
    و
    1.40
    1.37
    ينا
    1.36
    Кроме
    1.34
    ли
    1.31
     появились
    1.30
    POSITIVE LOGITS
     pleasures
    2.18
     joys
    1.71
    minded
    1.69
     dlatego
    1.58
     elegance
    1.53
    س
    1.45
     fact
    1.45
     voilà
    1.42
    weg
    1.39
     straightforward
    1.38
    Act Density 0.154%

    No Known Activations