INDEX
    Explanations

    statements highlighting societal biases and inconsistencies

    New Auto-Interp
    Negative Logits
    lÃŃ
    -0.15
    iola
    -0.15
    ãĥ³ãĤº
    -0.14
    _BUF
    -0.14
    ãĥ³ãĤ°
    -0.14
    oretical
    -0.14
    yonel
    -0.14
    arak
    -0.14
    erable
    -0.13
    rams
    -0.13
    POSITIVE LOGITS
     even
    0.29
    even
    0.25
     almost
    0.24
     даже
    0.22
    almost
    0.21
     sogar
    0.21
     EVEN
    0.20
     Even
    0.19
    Even
    0.19
     it
    0.19
    Act Density 0.101%

    No Known Activations