INDEX
    Explanations

    references to information that is considered false or deceptive

    references to "fake news."

    New Auto-Interp
    Negative Logits
    xual
    -0.81
    hens
    -0.73
    arching
    -0.73
    APTER
    -0.71
    azar
    -0.71
    }}}
    -0.70
     Discuss
    -0.68
    ands
    -0.68
    ires
    -0.67
    Reviewed
    -0.66
    POSITIVE LOGITS
     fake
    0.86
    ²¾
    0.83
     pas
    0.83
     phony
    0.72
     bait
    0.71
    ument
    0.70
     Fake
    0.70
     reef
    0.70
    ulously
    0.67
    eln
    0.67
    Act Density 0.016%

    No Known Activations