INDEX
    Explanations

    code and language models

    New Auto-Interp
    Negative Logits
     લગ્ન
    -0.08
     oval
    -0.08
     painted
    -0.08
     CWE
    -0.07
     جمهوری
    -0.07
     الزواج
    -0.07
     sport
    -0.07
     Republike
    -0.07
     Painted
    -0.07
    -0.07
    POSITIVE LOGITS
     chatbot
    0.12
     GPT
    0.12
    .ll
    0.11
    GPT
    0.11
    _ll
    0.10
    chunk
    0.10
     llama
    0.10
     ll
    0.10
     chunk
    0.10
     Chunk
    0.10
    Act Density 0.012%

    No Known Activations