INDEX
    Explanations

    phrases related to controversy or conflict, particularly around online harassment

    New Auto-Interp
    Negative Logits
     hairc
    -1.37
     fuf
    -1.36
     scrat
    -1.30
     increa
    -1.28
     sappi
    -1.27
     guarante
    -1.27
     chrysler
    -1.25
     emphat
    -1.24
     unve
    -1.24
     maneu
    -1.23
    POSITIVE LOGITS
     Instead
    0.85
     They
    0.84
     Specifically
    0.75
     ***!
    0.74
     Firstly
    0.71
    They
    0.70
     He
    0.70
     After
    0.69
    ])):
    0.69
    Instead
    0.69
    Act Density 0.596%

    No Known Activations