INDEX
Explanations
legal and ethical constraints regarding actions and behaviors
New Auto-Interp
Negative Logits
ogen
-0.15
chos
-0.15
lett
-0.15
Intercept
-0.15
å¾Ĵ
-0.15
oggler
-0.14
alphabet
-0.14
iser
-0.14
geçir
-0.14
Sie
-0.14
POSITIVE LOGITS
illegal
0.39
illegal
0.35
against
0.31
Illegal
0.30
prohibited
0.29
frowned
0.29
Illegal
0.28
grounds
0.28
against
0.28
forbidden
0.26
Activations Density 0.221%