INDEX
    Explanations

    phrases that express direct addresses or inclusivity towards the audience

    New Auto-Interp
    Negative Logits
    utor
    -0.15
    人人
    -0.14
    rts
    -0.14
    _SPE
    -0.14
     secondary
    -0.13
    ä¼ı
    -0.13
     Surveillance
    -0.13
     $(
    -0.13
    oci
    -0.13
     stride
    -0.13
    POSITIVE LOGITS
     readers
    0.31
     reader
    0.26
     Readers
    0.25
     unfamiliar
    0.24
    reader
    0.22
     Reader
    0.22
    Reader
    0.21
     reading
    0.20
    -reader
    0.20
     wondering
    0.19
    Act Density 0.071%

    No Known Activations