INDEX
    Explanations

    thematic elements related to cultural or racial identity

    New Auto-Interp
    Negative Logits
     [
    -0.20
    -0.19
    ...\
    -0.19
       
    -0.18
     ...\
    -0.17
    ...");↵↵
    -0.17
    ,...
    -0.16
    ãĢĮâ̦â̦
    -0.16
     “â̦
    -0.16
    ...</
    -0.15
    POSITIVE LOGITS
    (.
    0.27
     .
    0.26
     .↵
    0.25
    ".↵
    0.25
     (.
    0.24
     ".↵
    0.22
    ".
    0.22
     ".
    0.21
     -.
    0.20
     ().
    0.19
    Act Density 0.001%

    No Known Activations