INDEX
    Explanations

    instances of the word "don't"

    New Auto-Interp
    Negative Logits
     “
    -0.93
    =”
    -0.87
    =’
    -0.80
    .”
    -0.80
    ,”
    -0.80
    ”,
    -0.76
     (“
    -0.75
    …”
    -0.74
    ”),
    -0.73
    ?”
    -0.72
    POSITIVE LOGITS
    '
    1.68
     '
    1.43
    "
    1.39
    。"
    1.37
    '"
    1.28
     "
    1.28
    <bos>
    1.24
    '.
    1.23
    "'
    1.20
    '...
    1.17
    Act Density 0.655%

    No Known Activations