INDEX
    Explanations

    references to collective identity or community

    New Auto-Interp
    Negative Logits
     itself
    -0.21
     themselves
    -0.21
    e
    -0.17
    oad
    -0.16
    er
    -0.15
    lectron
    -0.15
    ton
    -0.15
    inia
    -0.15
    (s
    -0.15
    noon
    -0.14
    POSITIVE LOGITS
    /us
    0.39
    /me
    0.31
    /her
    0.28
    /th
    0.27
    enet
    0.27
    ury
    0.26
    urious
    0.23
     ourselves
    0.23
    self
    0.22
    usal
    0.22
    Act Density 0.068%

    No Known Activations