INDEX
    Explanations

    pronouns related to individuals or groups

    New Auto-Interp
    Negative Logits
     itself
    -0.17
    e
    -0.17
    taire
    -0.15
    onaut
    -0.15
    ayne
    -0.14
    ï
    -0.14
    pom
    -0.14
    purple
    -0.14
    ibel
    -0.14
    isma
    -0.14
    POSITIVE LOGITS
    /us
    0.38
    /her
    0.35
    self
    0.26
    zelf
    0.25
    SELF
    0.25
    /th
    0.23
    iner
    0.21
    atically
    0.20
    -même
    0.20
    chy
    0.19
    Act Density 0.155%

    No Known Activations