INDEX
    Explanations

    It reliably activates on boilerplate instructional phrases (e.g. “In this article we will discuss…”).

    New Auto-Interp
    Negative Logits
     globalization
    -0.06
     magic
    -0.06
     대행
    -0.06
    iet
    -0.06
    rust
    -0.06
    erokee
    -0.06
     раст
    -0.06
     futile
    -0.06
    IDX
    -0.06
    reation
    -0.06
    POSITIVE LOGITS
     juin
    0.07
     <:
    0.07
    ...',
    0.06
    '],
    0.06
    					      
    0.06
     antibody
    0.06
     skilled
    0.06
     +'
    0.06
    _progress
    0.06
     Honor
    0.06
    Act Density 0.032%

    No Known Activations