INDEX
    Explanations

    references to historical narratives and societal perceptions related to race and privilege

    New Auto-Interp
    Negative Logits
    rungsseite
    -1.11
     Monfieur
    -1.02
     étoient
    -1.02
     myſelf
    -0.98
     مشين
    -0.98
     propOrder
    -0.97
     avoient
    -0.95
     wikipagina
    -0.94
     ainfi
    -0.94
     bezeichneter
    -0.92
    POSITIVE LOGITS
    0.76
      
    0.74
    0.72
    ↵↵
    0.67
    ,
    0.67
    <eos>
    0.66
    .
    0.65
     '
    0.64
     O
    0.64
     a
    0.63
    Act Density 0.439%

    No Known Activations