INDEX
    Explanations

    words related to deception or false information

    mentions of hoaxes and pranks

    New Auto-Interp
    Negative Logits
    bourg
    -0.68
     Borders
    -0.68
    aws
    -0.66
    ailable
    -0.65
    uv
    -0.65
    uner
    -0.62
    udeau
    -0.62
    bilt
    -0.62
    oyal
    -0.61
    asper
    -0.60
    POSITIVE LOGITS
     hoax
    1.03
    sters
    0.87
    ²¾
    0.84
    erella
    0.81
    ually
    0.80
    ishly
    0.79
    edly
    0.79
    es
    0.78
    ulence
    0.75
    ed
    0.75
    Act Density 0.029%

    No Known Activations