INDEX
    Explanations

    references to deception or falsehoods, particularly concerning "fake" news and related concepts

    New Auto-Interp
    Negative Logits
     '\\;'
    -0.72
     ſhall
    -0.71
    KommentareTeilen
    -0.70
     muſt
    -0.68
    findpost
    -0.63
     anſ
    -0.63
     pouvoit
    -0.62
    ölkerung
    -0.61
    اریخ
    -0.60
    >--}}
    -0.60
    POSITIVE LOGITS
     fake
    1.25
     faking
    1.18
     faked
    1.13
     fakes
    1.10
    fake
    1.07
    Fake
    1.06
     Fake
    1.03
     mock
    1.01
     phony
    1.00
     pretending
    0.96
    Act Density 2.747%

    No Known Activations