INDEX
Explanations
hate and offensive language
New Auto-Interp
Negative Logits
toFixed
0.40
ischem
0.38
সানডে
0.38
Arqu
0.37
حلقه
0.37
Penc
0.37
ाइस
0.36
Eis
0.36
quelconque
0.36
emis
0.36
POSITIVE LOGITS
notes
0.54
track
0.42
realize
0.42
notes
0.42
offend
0.42
cheat
0.41
ノート
0.41
note
0.40
最小
0.40
access
0.40
Activations Density 0.000%