INDEX
Explanations
phrases indicating substitution or replacement of concepts or items
New Auto-Interp
Negative Logits
azon
-0.17
vla
-0.16
ENCHMARK
-0.15
Rew
-0.14
agle
-0.14
нÑĮ
-0.14
agrid
-0.14
ovu
-0.14
uren
-0.14
Bever
-0.14
POSITIVE LOGITS
substitute
0.27
replace
0.27
replace
0.26
replacing
0.24
replaces
0.23
substit
0.22
replacement
0.22
Replace
0.21
substitution
0.21
replaced
0.20
Activations Density 0.097%