INDEX
    Explanations

    negation/reasoning

    New Auto-Interp
    Negative Logits
     presets
    -0.07
    aticon
    -0.07
     Bentley
    -0.06
     morally
    -0.06
     Sind
    -0.06
     sof
    -0.06
     sensible
    -0.06
     postponed
    -0.06
    _modes
    -0.06
    .includes
    -0.06
    POSITIVE LOGITS
     لت
    0.07
     увагу
    0.06
    大家
    0.06
    strategy
    0.06
     :)
    0.06
     tier
    0.06
     SUR
    0.06
     outputFile
    0.06
    쳤다
    0.06
     Claude
    0.06
    Act Density 0.002%

    No Known Activations