INDEX
    Explanations

    determiner phrases, particularly ones starting with "the" or "a"

    New Auto-Interp
    Negative Logits
     itſelf
    -1.17
     Theſe
    -1.17
     ་་
    -1.16
     Jefus
    -1.13
     ―――――
    -1.12
     iſt
    -1.11
     myſelf
    -1.11
     ſind
    -1.08
     محفوظة
    -1.07
    )";
    
    -1.06
    POSITIVE LOGITS
     on
    1.34
     On
    0.86
    0.86
    On
    0.81
     ON
    0.75
     to
    0.74
     in
    0.74
    ↵↵
    0.71
    .
    0.71
     at
    0.71
    Act Density 0.197%

    No Known Activations