INDEX
    Explanations

    references to historical and societal critiques, particularly in relation to issues of oppression or unethical practices

    New Auto-Interp
    Negative Logits
    Bias
    -0.16
     intrig
    -0.16
    bias
    -0.15
    /welcome
    -0.15
    Spo
    -0.15
     Bias
    -0.15
    ussen
    -0.14
    istani
    -0.14
     biased
    -0.14
    à¹Īา
    -0.14
    POSITIVE LOGITS
     demon
    0.25
     romantic
    0.25
     lion
    0.24
     commod
    0.23
     glam
    0.23
    å¦ĸ
    0.23
     trivial
    0.22
     valor
    0.22
     normal
    0.22
     NORMAL
    0.22
    Act Density 0.172%

    No Known Activations