INDEX
    Explanations

    the letters 't' with a high activation value

    negations and the word "not."

    New Auto-Interp
    Negative Logits
     Reloaded
    -0.72
     Passenger
    -0.65
    ħĭ
    -0.65
     descent
    -0.64
     Penguin
    -0.61
     Pike
    -0.60
     Seah
    -0.59
     behavi
    -0.59
     Palestin
    -0.59
    çĦ
    -0.59
    POSITIVE LOGITS
    ween
    1.01
    reprene
    0.93
    unes
    0.92
    une
    0.91
    urb
    0.82
    urtles
    0.82
    weet
    0.82
    aper
    0.81
    ruly
    0.80
    UNE
    0.78
    Act Density 0.113%

    No Known Activations