INDEX
    Explanations

    colons followed by sentences

    punctuation or formatting indicators in the text

    New Auto-Interp
    Negative Logits
     reconc
    -0.74
     tremend
    -0.73
    æ©
    -0.72
     ingred
    -0.70
     territ
    -0.68
     bean
    -0.66
    ¥ŀ
    -0.65
     adversaries
    -0.65
     diseng
    -0.65
     manif
    -0.64
    POSITIVE LOGITS
     âĨij
    1.35
     Originally
    1.10
    Originally
    0.91
    Show
    0.88
     Interesting
    0.88
     Regarding
    0.85
     Yeah
    0.84
     Assuming
    0.84
     Whilst
    0.82
     >>
    0.81
    Act Density 0.059%

    No Known Activations