INDEX
    Explanations

    disinformation misinformation fake news

    New Auto-Interp
    Negative Logits
     촬영
    0.44
    Pain
    0.42
     introd
    0.41
    0.41
    Stim
    0.41
     ছুই
    0.41
     adventurer
    0.41
    algèbre
    0.39
    queleto
    0.39
    🌃
    0.39
    POSITIVE LOGITS
     disinformation
    1.78
     misinformation
    1.72
     fake
    1.34
     propaganda
    1.34
     Fake
    1.25
    Fake
    1.22
     falsehood
    1.20
    fake
    1.14
     propagand
    1.13
     Propaganda
    1.10
    Act Density 0.024%

    No Known Activations