INDEX
    Explanations

    This neuron detects self-descriptive phrases indicating the model’s operation “based on” its training data.

    New Auto-Interp
    Negative Logits
    Dealer
    -0.06
      ↵↵↵
    -0.06
    ать
    -0.06
    "When
    -0.06
     SHOW
    -0.06
    Goals
    -0.06
     Two
    -0.06
     boots
    -0.06
    -0.06
    12
    -0.06
    POSITIVE LOGITS
     locksmith
    0.07
    (nr
    0.07
    div
    0.07
    sov
    0.07
    .ins
    0.07
    <QString
    0.06
     clf
    0.06
     Sob
    0.06
     LIB
    0.06
    .customer
    0.06
    Act Density 0.028%

    No Known Activations