INDEX
    Explanations

    terms related to adversarial entities or threats

    New Auto-Interp
    Negative Logits
     erſt
    -0.73
     dieſer
    -0.72
     verſch
    -0.72
     müſſen
    -0.71
     wiſſen
    -0.70
     unſer
    -0.70
     ſeinem
    -0.69
     ſelbſt
    -0.69
     ſans
    -0.69
     ſeinen
    -0.69
    POSITIVE LOGITS
     enemy
    1.33
     opponent
    1.21
     opponents
    1.20
    enemy
    1.08
     enemies
    1.07
     Enemy
    1.07
     musuh
    1.05
     adversaries
    1.02
    Enemy
    1.00
     enemigo
    0.98
    Act Density 0.277%

    No Known Activations