INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     raiſ
    -0.69
     ſeveral
    -0.69
     myſelf
    -0.69
     whoſe
    -0.68
     themſelves
    -0.68
     fubject
    -0.64
    SError
    -0.63
    Personendaten
    -0.63
    rungsseite
    -0.63
     uſ
    -0.62
    POSITIVE LOGITS
     shock
    0.73
    InjectAttribute
    0.68
     kann
    0.66
     wird
    0.65
     shocks
    0.63
    shock
    0.61
     SHOCK
    0.58
     konnte
    0.58
     Dong
    0.58
    Shock
    0.56
    Act Density 0.069%

    No Known Activations