INDEX
    Explanations

    phrases that express negation or limitation

    New Auto-Interp
    Negative Logits
     itſelf
    -1.12
     raiſ
    -0.96
     himſelf
    -0.95
     myſelf
    -0.90
    ſelf
    -0.88
     ſche
    -0.86
     pleaſure
    -0.86
     ſch
    -0.86
     cauſe
    -0.85
     ſta
    -0.84
    POSITIVE LOGITS
     schon
    1.05
     noch
    1.00
     nog
    0.86
     nicht
    0.85
     auch
    0.74
    nicht
    0.72
     niet
    0.71
     nur
    0.69
     immer
    0.68
    Noch
    0.68
    Act Density 0.139%

    No Known Activations