INDEX
    Explanations

    comparative phrases that assess superiority or inferiority

    New Auto-Interp
    Negative Logits
     Prev
    -0.15
    iden
    -0.15
    ker
    -0.15
    æ··åIJĪ
    -0.14
    _PICTURE
    -0.14
     Rubio
    -0.14
    illus
    -0.14
    urga
    -0.14
    stagram
    -0.14
    ä¸įäºĨ
    -0.14
    POSITIVE LOGITS
     original
    0.22
     originals
    0.20
    original
    0.19
     direct
    0.19
    ORIGINAL
    0.18
    åİŁå§ĭ
    0.18
    direct
    0.18
     оÑĢиг
    0.17
    缴æİ¥
    0.17
    straight
    0.17
    Act Density 0.005%

    No Known Activations