INDEX
    Explanations

    First-person pronouns

    New Auto-Interp
    Negative Logits
     fontsize
    -0.07
    -",
    -0.06
    .yaml
    -0.06
     swear
    -0.06
    -0.06
     runner
    -0.06
    rc
    -0.06
    –and
    -0.06
     mars
    -0.06
     WRONG
    -0.06
    POSITIVE LOGITS
     "
    0.06
     indian
    0.06
     '">'
    0.06
     suicidal
    0.06
     hero
    0.05
    0.05
     MAN
    0.05
     земель
    0.05
    asset
    0.05
    0.05
    Act Density 0.189%

    No Known Activations