INDEX
Explanations
phrases related to harm or injury
New Auto-Interp
Negative Logits
ancel
-0.15
anten
-0.14
žen
-0.14
رد
-0.14
ÅĤu
-0.13
cess
-0.13
Pix
-0.13
važ
-0.13
which
-0.13
ãĥĢãĤ¤
-0.13
POSITIVE LOGITS
ÙĪØ§ÙĦتÙĬ
0.17
ï¼īãģ®
0.16
:;↵
0.15
ï¼īçļĦ
0.15
@js
0.15
lew
0.15
sino
0.15
ï¼Į以åıĬ
0.15
])->
0.14
')['
0.14
Activations Density 1.257%