INDEX
    Explanations

    indicative statements criticizing societal structures or norms

    New Auto-Interp
    Negative Logits
     wing
    -0.16
    iddy
    -0.15
    labs
    -0.15
    ose
    -0.14
    رات
    -0.14
    ếp
    -0.14
    ign
    -0.14
    monton
    -0.14
     stru
    -0.14
    lak
    -0.14
    POSITIVE LOGITS
     instead
    0.69
    instead
    0.63
     Instead
    0.61
    Instead
    0.59
     вмеÑģÑĤ
    0.47
     Nope
    0.32
     inve
    0.28
     mÃŃsto
    0.27
    à¹ģà¸Ĺà¸Ļ
    0.25
     sondern
    0.23
    Act Density 0.207%

    No Known Activations