INDEX
Explanations
refusal to generate harmful content
New Auto-Interp
Negative Logits
ابن
0.97
за
0.96
ishu
0.94
на
0.94
,
0.92
ณ
0.91
далее
0.89
далі
0.89
in
0.89
u
0.89
POSITIVE LOGITS
mennesker
1.23
だったら
1.20
mDatas
1.18
resultContent
1.17
क्षर
1.16
gameField
1.15
multipart
1.14
ObjData
1.13
ষে
1.13
getSize
1.12
Activations Density 0.042%