INDEX
    Explanations

    references to family and social connections

    New Auto-Interp
    Negative Logits
    /Instruction
    -0.15
    otyping
    -0.14
     Manson
    -0.14
    ipar
    -0.14
     Cly
    -0.14
    840
    -0.14
    /mit
    -0.14
    ophobic
    -0.14
    ameron
    -0.13
    itch
    -0.13
    POSITIVE LOGITS
    rief
    0.17
     CONS
    0.16
    eyen
    0.15
    лем
    0.14
     pÅĻe
    0.14
    xcf
    0.14
    ainen
    0.14
    obox
    0.14
    uka
    0.14
     toes
    0.13
    Act Density 0.126%

    No Known Activations