INDEX
    Explanations

    references to the color white in various contexts

    New Auto-Interp
    Negative Logits
    nt
    -0.20
    nd
    -0.20
     purple
    -0.18
    name
    -0.17
    nya
    -0.17
    nts
    -0.17
    ments
    -0.16
    rian
    -0.16
    epad
    -0.16
    mand
    -0.15
    POSITIVE LOGITS
     supremacist
    0.22
    -white
    0.22
    -collar
    0.21
    legg
    0.20
    WHITE
    0.20
    White
    0.19
    chalk
    0.19
    caps
    0.19
    Noise
    0.19
     White
    0.19
    Act Density 0.041%

    No Known Activations