INDEX
    Explanations

    names and dates in social media posts

    references to Twitter handles or user mentions

    New Auto-Interp
    Negative Logits
     harms
    -0.75
    houses
    -0.72
    itory
    -0.71
     attractions
    -0.68
     pockets
    -0.68
     clothes
    -0.68
     criminals
    -0.66
    fruit
    -0.64
     medicines
    -0.63
    Ĥª
    -0.63
    POSITIVE LOGITS
    TPS
    1.00
    76561
    0.83
    Twe
    0.78
    >]
    0.71
     Ùħ
    0.70
     VIDEOS
    0.69
    Official
    0.69
    VERTIS
    0.68
    Patch
    0.67
     Originally
    0.66
    Act Density 0.043%

    No Known Activations