INDEX
    Explanations

    This neuron activates on words and phrases that signal explanation, reasoning, or justification (e.g., logic, decision, reason, analysis).

    language expressing evaluation or judgment (opinionated/editorial statements).

    language that critiques or questions decisions and actions, especially highlighting terms about logic, mistakes, errors, and controversial choices.

    New Auto-Interp
    Negative Logits
     Pool
    -0.07
    sex
    -0.07
    recio
    -0.07
    -0.06
    composition
    -0.06
     Responsibility
    -0.06
    &eacute
    -0.06
    .memory
    -0.06
    shop
    -0.06
     tubes
    -0.06
    POSITIVE LOGITS
     徒歩
    0.07
     halted
    0.07
     øns
    0.07
     theoret
    0.06
     kvin
    0.06
    0.06
    ,"
    0.06
     Inspector
    0.06
     runoff
    0.06
     CONSTANT
    0.06
    Act Density 0.084%

    No Known Activations