INDEX
    Explanations

    large language model created by

    The neuron strongly activates on the pattern where the model refers to itself as “a large language model,” i.e. self-identification phrases stating “As a large language model…”

    New Auto-Interp
    Negative Logits
     capables
    0.48
     означа
    0.41
     offrant
    0.40
     capaces
    0.40
     схема
    0.39
     tribulations
    0.39
     система
    0.38
     வரவே
    0.38
     없고
    0.38
     hereby
    0.38
    POSITIVE LOGITS
     goes
    0.42
     Rainbow
    0.42
    0.42
     رفت
    0.41
     itself
    0.40
    我是
    0.40
     моего
    0.40
     অস্বাভাবিক
    0.39
     Southwest
    0.39
     cleaned
    0.38
    Act Density 0.017%

    No Known Activations