INDEX
Explanations
instances of refusal or denial in various contexts
New Auto-Interp
Negative Logits
patch
-0.15
olle
-0.15
سÙĦاÙħ
-0.14
.getOwnProperty
-0.14
Reserved
-0.14
385
-0.14
lesbienne
-0.14
alli
-0.14
abus
-0.14
ãģ¤ãģ¶
-0.14
POSITIVE LOGITS
let
0.20
allow
0.18
accepting
0.17
letting
0.17
admit
0.17
accept
0.17
accepts
0.17
Accept
0.16
allowing
0.16
unless
0.15
Activations Density 0.065%