INDEX
    Explanations

    This neuron detects the model’s self-description phrase “As a large language model” (and similar self-referential disclaimers).

    New Auto-Interp
    Negative Logits
     BK
    0.43
     creciente
    0.43
     Б
    0.40
     приветствую
    0.39
     B
    0.38
     расту
    0.38
     aumentada
    0.38
     blight
    0.37
     समावेश
    0.37
    esp
    0.37
    POSITIVE LOGITS
     doesn
    0.45
    ستانی
    0.41
    doesn
    0.40
    0.38
    0.37
    വുമായി
    0.37
     item
    0.36
     stanu
    0.36
    0.36
     eikä
    0.36
    Act Density 0.017%

    No Known Activations