INDEX
Explanations
abusive or offensive language
New Auto-Interp
Negative Logits
エ
0.39
මන්
0.38
مش
0.38
جز
0.38
ಂಜ
0.38
الم
0.37
ponemos
0.37
います
0.37
Essas
0.36
ሁሉም
0.36
POSITIVE LOGITS
argues
0.45
itrile
0.43
RowFilter
0.42
argued
0.42
vée
0.42
argument
0.41
rot
0.40
ករណ៍
0.40
grandchild
0.40
ell
0.40
Activations Density 0.011%