INDEX
    Explanations

    data analysis, measurement

    counter-narratives to hate speech examples.

    New Auto-Interp
    Negative Logits
    _math
    -0.07
    Box
    -0.06
    Baseline
    -0.06
    
    -0.06
    \$
    -0.06
    ./
    -0.06
    .bulk
    -0.06
    -disc
    -0.06
    (com
    -0.06
    .Yes
    -0.06
    POSITIVE LOGITS
    ियल
    0.07
     picturesque
    0.07
    189
    0.07
    odafone
    0.07
    169
    0.06
     businesses
    0.06
    476
    0.06
     preach
    0.06
     cardiovascular
    0.06
     technology
    0.06
    Act Density 0.005%

    No Known Activations