INDEX
    Explanations

    phrases related to false information or deceit

    New Auto-Interp
    Negative Logits
    xual
    -0.95
    hem
    -0.75
    night
    -0.73
    served
    -0.73
    riott
    -0.71
    onential
    -0.69
    onen
    -0.66
    pai
    -0.65
     guiActiveUnfocused
    -0.64
    interrupted
    -0.64
    POSITIVE LOGITS
    ument
    1.02
     news
    0.94
     IDs
    0.87
     pas
    0.87
     positives
    0.81
    ulent
    0.79
     NEWS
    0.73
     identities
    0.70
    ulence
    0.70
    outs
    0.67
    Act Density 0.055%

    No Known Activations