INDEX
    Explanations

    sections of text labeled as "Categories."

    New Auto-Interp
    Negative Logits
    aid
    -0.16
     Rubin
    -0.14
     Jung
    -0.14
    ollen
    -0.14
    sie
    -0.14
    moth
    -0.14
     Trev
    -0.14
    odom
    -0.13
     Shops
    -0.13
    ette
    -0.13
    POSITIVE LOGITS
    rong
    0.17
    åĽ
    0.16
     GOODMAN
    0.16
    .foundation
    0.16
    má
    0.15
    apeut
    0.15
    gien
    0.14
    deme
    0.14
    ynam
    0.13
    ooled
    0.13
    Act Density 0.004%

    No Known Activations