INDEX
Explanations
phrases that indicate substitution or replacement
New Auto-Interp
Negative Logits
azon
-0.16
agrid
-0.15
ENCHMARK
-0.15
rna
-0.15
udad
-0.14
vla
-0.14
uren
-0.14
.SYSTEM
-0.14
TestCategory
-0.14
quam
-0.14
POSITIVE LOGITS
substitute
0.26
replace
0.24
replace
0.24
substitution
0.23
replacing
0.23
replaces
0.22
substit
0.22
substitutes
0.20
Substitute
0.20
replacement
0.20
Activations Density 0.065%