INDEX
    Explanations

    expressions of frustration or disbelief

    phrases that express confusion or outrage

    New Auto-Interp
    Negative Logits
    ierrez
    -0.74
    onial
    -0.64
    orea
    -0.61
    ridor
    -0.61
    izont
    -0.60
    itures
    -0.60
    ateral
    -0.60
    utor
    -0.59
    selves
    -0.58
    ettlement
    -0.57
    POSITIVE LOGITS
     hell
    1.59
     heck
    1.48
     fuck
    1.44
     HELL
    1.33
     Fuck
    1.13
     FUCK
    1.10
     Hell
    1.05
     heavens
    1.04
    fuck
    0.99
     gods
    0.98
    Act Density 0.097%

    No Known Activations