INDEX
    Explanations

    affirmations and expressions of agreement

    New Auto-Interp
    Negative Logits
    ocu
    -0.16
     dök
    -0.15
     Hey
    -0.14
    acht
    -0.14
    alian
    -0.14
     PERMISSION
    -0.13
     Uns
    -0.13
    fet
    -0.13
    atra
    -0.13
    mitt
    -0.13
    POSITIVE LOGITS
     wrong
    0.75
     correct
    0.75
    wrong
    0.64
     Wrong
    0.60
     WRONG
    0.57
    Wrong
    0.57
    correct
    0.57
     incorrect
    0.56
     Correct
    0.53
    Correct
    0.52
    Act Density 0.226%

    No Known Activations