INDEX
Explanations
mentions of testing environments or controlled settings
New Auto-Interp
Negative Logits
pong
-0.07
erna
-0.07
minster
-0.07
OOT
-0.06
ạc
-0.06
oot
-0.06
fich
-0.06
طاÙĤ
-0.06
accom
-0.06
903
-0.06
POSITIVE LOGITS
ayar
0.07
celik
0.07
itories
0.07
sandbox
0.06
aha
0.06
itorio
0.06
attery
0.06
ndl
0.06
dg
0.06
enler
0.06
Activations Density 0.000%