INDEX
    Explanations

    phrases indicating negation or absence

    New Auto-Interp
    Negative Logits
     fometimes
    -0.80
     Anſ
    -0.79
     ſind
    -0.74
     ſeveral
    -0.74
     myſelf
    -0.74
     chofe
    -0.72
     Catto
    -0.72
     fhort
    -0.71
     itſelf
    -0.71
     iſt
    -0.69
    POSITIVE LOGITS
     non
    0.88
     Non
    0.83
    Non
    0.81
     without
    0.77
     ilman
    0.77
    0.75
    Without
    0.73
     ohne
    0.72
    AndEndTag
    0.72
    Без
    0.72
    Act Density 0.539%

    No Known Activations