INDEX
    Explanations

    The neuron responds to positive evaluative words (e.g. adjectives expressing approval or appeal).

    New Auto-Interp
    Negative Logits
     happening
    -0.07
    on
    -0.07
     sending
    -0.07
     helps
    -0.07
    _col
    -0.07
    Prob
    -0.07
     totals
    -0.07
     Work
    -0.07
     hurt
    -0.06
     dos
    -0.06
    POSITIVE LOGITS
     attractive
    0.12
    tractive
    0.09
     attractiveness
    0.09
     brightly
    0.08
     glamorous
    0.08
    0.07
    _THEME
    0.07
     entrenched
    0.07
    леч
    0.07
    رانه
    0.07
    Act Density 0.009%

    No Known Activations