INDEX
Explanations
phrases indicating risks and threats to health or safety
New Auto-Interp
Negative Logits
Tale
-0.17
ero
-0.15
èħ
-0.15
mani
-0.15
Inline
-0.14
ani
-0.14
ona
-0.14
onas
-0.14
QRSTUV
-0.14
imm
-0.14
POSITIVE LOGITS
unden
0.15
رÙĬÙģ
0.14
ellation
0.14
oord
0.14
ullo
0.14
grading
0.14
egasus
0.14
оÑĢом
0.14
acus
0.13
agem
0.13
Activations Density 0.012%