INDEX
    Explanations

    The neuron detects terms referring to personal consent and limits—words like “boundaries,” “preferences,” and “autonomy.”

    New Auto-Interp
    Negative Logits
    Iso
    -0.07
     console
    -0.07
    _MALLOC
    -0.06
    luluk
    -0.06
     tours
    -0.06
     fast
    -0.06
    ($"{
    -0.06
    science
    -0.06
     convin
    -0.06
     heating
    -0.06
    POSITIVE LOGITS
    0.06
    імеч
    0.06
    орд
    0.06
     boundaries
    0.06
     постро
    0.06
     Accept
    0.06
    =torch
    0.06
     Bought
    0.06
     방법
    0.06
    upal
    0.06
    Act Density 0.007%

    No Known Activations