INDEX
    Explanations

    explaining types of concepts

    New Auto-Interp
    Negative Logits
     at
    0.68
     elsewhere
    0.59
     its
    0.59
    اتها
    0.47
     on
    0.44
    Fourth
    0.44
     e
    0.43
    0.42
     product
    0.41
     opp
    0.40
    POSITIVE LOGITS
     dentro
    1.11
     Dentro
    1.09
     binnen
    0.96
     Within
    0.94
    Within
    0.94
     entrar
    0.92
    𝙈
    0.90
     inom
    0.90
     داخل
    0.90
    Dentro
    0.89
    Act Density 0.271%

    No Known Activations