INDEX
    Explanations

    research results

    New Auto-Interp
    Negative Logits
    +n
    -0.07
    RI
    -0.06
    ิร
    -0.06
     Pref
    -0.06
    +p
    -0.06
    ρων
    -0.06
     Haupt
    -0.06
    .herokuapp
    -0.06
    -0.06
    ingleton
    -0.06
    POSITIVE LOGITS
    ...,
    0.07
     Interior
    0.07
     result
    0.07
     morally
    0.07
    Friendly
    0.07
    (Stack
    0.07
     nga
    0.07
    loha
    0.06
     -->↵↵↵
    0.06
     trusted
    0.06
    Act Density 0.071%

    No Known Activations