INDEX
    Explanations

    The neuron is essentially “dead” in these examples—it never activates on any token, so it isn’t detecting any pattern.

    New Auto-Interp
    Negative Logits
    (permission
    -0.07
     Jon
    -0.07
    Jon
    -0.07
    Rocket
    -0.07
     Hakk
    -0.07
    Collector
    -0.07
     hayır
    -0.07
    333
    -0.07
     produkt
    -0.07
    ahir
    -0.06
    POSITIVE LOGITS
     Sarah
    0.08
     Moodle
    0.07
     sitting
    0.07
     stew
    0.07
     desk
    0.07
     Cristina
    0.07
     لل
    0.07
    Sarah
    0.06
     intoler
    0.06
     stressful
    0.06
    Act Density 0.006%

    No Known Activations