INDEX
    Explanations

    large language model, trained by

    tokens that indicate the speaker’s identity as a large language model (words like "large", "language", "model" and related self‑identifying phrases/questions).

    New Auto-Interp
    Negative Logits
    locks
    0.47
     resembled
    0.47
     resulted
    0.43
    becomes
    0.43
       
    0.43
    τού
    0.41
    ទ្រ
    0.41
         
    0.41
                         
    0.41
     becomes
    0.40
    POSITIVE LOGITS
     работаю
    0.67
    uyorum
    0.64
     знаю
    0.57
     हूं
    0.57
     jestem
    0.55
    ıyorum
    0.54
     atualmente
    0.54
    这意味着
    0.54
    我现在
    0.53
     আছি
    0.51
    Act Density 0.133%

    No Known Activations