INDEX
Explanations
data analysis, measurement
counter-narratives to hate speech examples.
New Auto-Interp
Negative Logits
_math
-0.07
Box
-0.06
Baseline
-0.06
-0.06
\$
-0.06
./
-0.06
.bulk
-0.06
-disc
-0.06
(com
-0.06
.Yes
-0.06
POSITIVE LOGITS
ियल
0.07
picturesque
0.07
189
0.07
odafone
0.07
169
0.06
businesses
0.06
476
0.06
preach
0.06
cardiovascular
0.06
technology
0.06
Activations Density 0.005%