Optimizing Costs in Synapse Analytics

*My favorite version of Copilot*‘s “Azure Synapse costs going down“. I’m particularly fond of the floating broccoli. Also, what is Azull Kolgee and are those even numbers?

I’ve been a bit off the grid.

This past January, I transitioned into a new job as a BI Developer at a large multinational manufacturing firm. My previous four years largely centered around front-end BI development; building dashboards, reports, data models, and interfacing with business users. My new role is giving me the opportunity to dive headfirst into Azure Synapse Analytics and learn the ins and outs of a data engineering in the modern Data Lakehouse.

The first project I’ve been working on with a coworker is optimizing costs in our Synapse environment. Prior to my joining, my coworker had already completed optimization of storage costs. Our Synapse environment uses delta lake in Azure Data Lake Storage Gen2 to store data in parquet format through various layers (delta and curated) and services the data out to end users and reporting tools using serverless SQL pool. Because each nightly load generates new copies of delta lake data, storage costs were beginning to become unwieldy. Through performing a weekly OPTIMIZE and VACUUM of the delta layer, storage costs have been greatly reduced.

My first project

We turned our attention to the next step, compute costs, shortly after I joined. Compute was running anywhere from $80-100 per day, and we needed better visibility into what “problem children” were causing these costs. We implemented pipeline logging on each nightly job, writing the activity metadata and start/end times for each table out to a custom log table in a database. We performed cost analysis using the metadata and created several dashboards in Power BI to track ongoing run times by table and activity. Since then, we’ve began pulling knobs and twisting levers, tracking each change, observing outcomes and forming conclusions about best practices for data movement in our environment.

The following are three of the conclusions I’ve drawn since working on this project the past six weeks. We’ve only just scratched the surface in terms of potential optimizations but have already seen some significant improvements (and not so helpful ones) through a few adjustments.

1. The vast majority of compute costs in Synapse come from running notebooks and data flows.

This should’ve been obvious to me at first, but I was shocked as I looked deeper. Pipeline runs are pennies on the dollar. In our environment, activity runs cost $1 for every 1,000 runs, and data movement is 25 cents per DIU-hour (I won’t go into DIU’s here, but for us they were minimal). For most nights, pipeline and copy data activities cost us less than $1 per day.

Notebooks and data flow activities, on the other hand, use vCores, a separate pricing model. Based on the size of the spark pool (which determines the number of vCores allocated) and the efficiency of the notebook, costs can quickly expand as more data is processed. We found that at least 90% of our compute costs were coming from notebook activities.

2. ZORDER is expensive to run and of minimal help (so far).

ZORDER is an optimization technique on delta files that can help performance by logically sorting data in parquet files. It works something like a clustered index in SQL Server. You choose a column (or multiple) to ZORDER by when running an OPTIMIZE command, and the output parquet files are bucketed into approximately 1GB size each, ordered by the column(s) you specify. This is supposed to help performance when you run a MERGE statement, as the engine can be more intelligent about where the updated source data fits into the target files, since they’re efficiently sorted.

We ran ZORDER on a hash column of the primary key of our table, which for a larger
tables (600 million records) took a fair amount of time to execute. We expected performance to improve the next night but found minimal if any improvements. The same experience occurred for smaller tables. I’m suspicious that this is because the hash randomizes values, and therefore, though they’re sorted, the engine still has to search through many parquet files to find the corresponding rows. It’s possible that ZORDER on something like dates could be more effective, but we haven’t run further tests on this.

3. Low-shuffle merge is very effective for large tables (and not so much for smaller).

This feature is a default in Spark 3.3, and optional to enable in Spark 3.2. Low-shuffle merge essentially performs one additional table scan when running a MERGE, excluding unmodified rows from an expensive shuffling operation. Previously, the engine would pick all delta parquet files with data to insert or update, which included some rows of data that were unmodified. These unmodified rows are now excluded and involved in a separate, more efficient write operation.

The performance improvements were drastic. A notebook that used to take 2 hours and 30 mins to run a MERGE on 600 million records now takes anywhere from 30-60 mins, a 2.5-5x improvement in time and cost. Interestingly, smaller tables (<1 million rows) appear to be adversely affected by the additional scan; some increased load times from 2 mins to 5-7 mins each. Our next step is to upgrade our Spark pools to version 3.3 and disable low-shuffle merge for any tables that don’t meet the 1 million row count threshold.

Ongoing

It’s clear to me that this optimization will be an ongoing process, and there are many more topics to explore. Some future plans include partitioning, overwrite instead of merge for smaller tables, adding more executors to the Spark Pool, and smarter load patterns for specific tables. We also intend to analyze serverless SQL pool query costs and begin to see if there are cost saving mechanisms there. I’ll likely be adding some additional posts on this topic as we progress.

What optimization techniques have you found to be helpful in Synapse? What other levers should we consider pulling?

Optimizing Costs in Synapse Analytics

My first project

1. The vast majority of compute costs in Synapse come from running notebooks and data flows.

2. ZORDER is expensive to run and of minimal help (so far).

3. Low-shuffle merge is very effective for large tables (and not so much for smaller).

Ongoing

Comments

Leave a comment Cancel reply

Optimizing Costs in Synapse Analytics

My first project

1. The vast majority of compute costs in Synapse come from running notebooks and data flows.

2. ZORDER is expensive to run and of minimal help (so far).

3. Low-shuffle merge is very effective for large tables (and not so much for smaller).

Ongoing

Share this:

Comments

Leave a comment Cancel reply