INDEX
Explanations
illegal or harmful content
words describing prohibited content types and policy violations on online platforms.
New Auto-Interp
Negative Logits
vinci
-1.24
marta
-1.20
satel
-1.16
はこんな感じ
-1.13
wanda
-1.13
philippe
-1.13
を知る
-1.12
dorado
-1.10
paulo
-1.09
marmor
-1.09
POSITIVE LOGITS
or
1.46
versátil
1.45
Bardzo
1.43
içeri
1.41
görüntüsü
1.30
delitos
1.27
Сергей
1.16
либо
1.15
content
1.15
extremadamente
1.14
Activations Density 0.036%