INDEX
    Explanations

    the presence of special characters or formatting elements in the text

    New Auto-Interp
    Negative Logits
     Diſ
    -0.85
     themſelves
    -0.84
     Conſ
    -0.83
     itſelf
    -0.80
     Inſ
    -0.79
     raiſ
    -0.79
     myſelf
    -0.78
     himſelf
    -0.78
     juſ
    -0.78
     ſta
    -0.78
    POSITIVE LOGITS
     aDecoder
    0.48
     d
    0.47
    MessageOf
    0.46
    en
    0.45
     Griswold
    0.44
    mphony
    0.44
    czę
    0.43
    Dragon
    0.42
     Hentet
    0.42
     ist
    0.42
    Act Density 0.001%

    No Known Activations