INDEX
Explanations
harmful activities and content
New Auto-Interp
Negative Logits
os
0.41
on
0.40
ز
0.38
েল
0.37
с
0.37
مان
0.36
ів
0.35
ной
0.35
ના
0.35
з
0.35
POSITIVE LOGITS
ר
0.36
victimization
0.36
assemblages
0.35
attacks
0.33
raids
0.32
repris
0.32
акции
0.31
advis
0.30
screenings
0.30
harassment
0.29
Activations Density 0.554%