One of the major pain points for me with Amazon Redshift has always been the coupling between storage and compute. Competitors like Snowflake and Google’s BigQuery offer independent compute and storage, which means easier (and quicker) scaling in times of increased load. Redshift’s main drawback in the scalability sense has been that it can take up to 24 hours to resize your cluster (during which it’s in read-only mode), meaning there’s a lot of pressure to get your cluster configuration spot on before you go into production. Redshift’s provision of elasticity is just not up to par with most of Amazon’s other services. While Redshift Spectrum helps with this, it’s not a solution to the issue of scalability for an existing cluster.
In the lead up to re:Invent, Amazon last night dropped a load of really neat announcements (server-side encryption for DynamoDB as standard, SSE support for SNS), among which was the reveal of Elastic resize for Redshift. As an aside, if this is the stuff they’re announcing now, there should be some really nice announcements at re:Invent.
Traditionally, when you identified a need to resize your Redshift Data Warehouse, you’d have to plan in some maintenance time to carry out the resize operation. This can typically take anything between 1-24 hours, depending on your node type, volume of data, and other factors.
Under the “classic” model, Redshift switch your cluster into read-only mode and take a snapshot of your data. It’ll then go away and provision an entirely new cluster that meets your new spec, and start loading all your data in from the snapshot. Only once this load operation is complete, does Redshift point your cluster endpoints over to the new cluster and release its read-only hold. The old cluster then gets destroyed.
As you can imagine, this is a time consuming and disruptive process. Do you really want your Enterprise Data Warehouse to be unavailable for writes for up to a day? While there are workarounds, such as provisioning a new cluster yourself and creating a pseudo-replication process, these are typically heavy on effort and cost.
As they often do, Amazon have recognised the pain point and worked to remedy it. Elastic resizing (read more here) massively improves the resizing process by turning it into a mainly online resize, and reduces the period of disruption from <24 hours, to only a few minutes.
If you want to understand how this works under the hood, I thoroughly recommend you watch Amazon’s online tech talk on the subject, which details how they’ve achieved Elastic resizing: https://pages.awscloud.com/Best-Practices-for-Scaling-Amazon-Redshift_1111-ABD_OD.html
At a high level, they’ve developed a way whereby some of the slices of your cluster can be transferred to new nodes in a transparent manner, minimising disruption, and allowing the cluster to remain read/write capable throughout the majority of the process. There may be some minor disruption, including query cancellations etc. but I’ll take a few minutes over several hours any day.
There are some limitations of course:
All-in-all, the introduction of the Elastic resize capability is a major plus for Redshift. While it doesn’t remove the coupled storage/compute setup, it does remove a major barrier in cluster scaling, and even opens doors to being able to scale up/down according to demand - a use case that has just never been really practical on Redshift until now.
Has anyone tried out Elastic resize so far? If so, let me know what you think of the capability and how this has impacted your business.