INDEX
    Explanations

    phrases indicating speech or expression of opinions

    New Auto-Interp
    Negative Logits
     ('
    -0.19
     (“
    -0.18
     ("
    -0.18
     («
    -0.16
     (`
    -0.16
     коÑĤоÑĢÑĭм
    -0.14
    owie
    -0.14
    .sz
    -0.14
    è½
    -0.14
    _DX
    -0.13
    POSITIVE LOGITS
    ,↵
    0.29
    ,
    0.25
    ,"
    0.23
    ,↵↵
    0.23
    ,”
    0.23
    :
    0.21
    ,"↵
    0.21
    ,č↵
    0.20
     ,↵
    0.18
    ,↵↵↵↵
    0.18
    Act Density 0.207%

    No Known Activations