INDEX
    Explanations

    references to the color white

    New Auto-Interp
    Negative Logits
    nt
    -0.23
    ments
    -0.20
    nd
    -0.20
    nya
    -0.18
    ment
    -0.17
    rest
    -0.17
    ly
    -0.16
    name
    -0.16
    AMES
    -0.16
    ively
    -0.15
    POSITIVE LOGITS
    -collar
    0.28
    hall
    0.26
    bread
    0.24
    caps
    0.24
     supremacist
    0.24
    -hot
    0.23
    papers
    0.23
    -trash
    0.22
    legg
    0.22
    board
    0.22
    Act Density 0.033%

    No Known Activations