INDEX
    Explanations

    mentions of the word "White" in various contexts

    New Auto-Interp
    Negative Logits
    nd
    -0.20
    nt
    -0.17
     purple
    -0.16
    rian
    -0.16
    epad
    -0.15
    ments
    -0.15
    istic
    -0.15
    nya
    -0.14
    roz
    -0.14
    scope
    -0.14
    POSITIVE LOGITS
     supremacist
    0.21
    -collar
    0.21
    -white
    0.20
    bread
    0.20
    aker
    0.20
    WHITE
    0.19
    -trash
    0.19
    White
    0.19
    paper
    0.19
    papers
    0.19
    Act Density 0.043%

    No Known Activations