INDEX
    Explanations

    phrases indicating concern for vulnerable groups in society

    New Auto-Interp
    Negative Logits
     deser
    -0.15
    nuts
    -0.14
    ddd
    -0.14
    rang
    -0.14
    (mask
    -0.13
    ovies
    -0.13
    wit
    -0.13
    MASK
    -0.13
     Mask
    -0.13
    nut
    -0.13
    POSITIVE LOGITS
     us
    0.18
    usat
    0.16
    chr
    0.15
    ayscale
    0.15
    orne
    0.13
     Sabb
    0.13
     Juda
    0.13
    ograd
    0.13
    ght
    0.13
    è¾
    0.13
    Act Density 0.150%

    No Known Activations