AWS is knocking it out of the park at the moment with loads of new services and features coming out every week. Indeed, it can be hard to keep up with the degree of change. But, while working on one of our Redshift clusters today we spotted a potential scoop that would remove a key blocker for one extremely useful service, Redshift Spectrum.
Up until now it’s only been possible to use Spectrum if you don’t have Enhanced VPC Routing enabled on your Redshift cluster. There are so many benefits to using Enhanced VPC Routing (reduced data transfer cost, control, security) that it’s hard to see why anyone wouldn’t be using it, especially if you move data between Redshift and S3 a lot.
But we spotted a new parameter being applied to one of our clusters when we made some maintenance changes to a parameter group. There’s now a parameter named
spectrum_enable_enhanced_vpc_routing showing, which hints that Amazon may be about to remove this crucial limitation.
Redshift Spectrum is a seriously cool name for what is essentially fluid extra horsepower for your Redshift cluster. One of the things commonly cited as a drawback for Redshift is the fact that storage is coupled with compute: there’s no way to scale up to more computing power without also scaling storage (and paying for it). Enter Spectrum.
Redshift Spectrum is an extension to Redshift that allows AWS users to use on-demand Redshift capability to instantly scale compute power in order to query data that is held in S3. This works by defining external tables in Redshift. These external tables are essentially metadata telling Redshift that the files in a specific S3 location are structured in a particular way, so that when a user issues a query against the external table, the Redshift query optimiser knows what the data is, and what it looks like.
When you query this external table, Redshift calculates the estimated data volumes, and computing power needed, and allocates some compute resources from a central pool in order to service your query. This all happens transparently, and ensures that you are temporarily allocated the necessary compute power to process your query in a reasonable timeframe.
Crucially, this answers the compute vs storage complaint and gives Redshift a similar capability to Google’s BigQuery, which had previously been missing.
I’ll delve into Spectrum in more detail in another post, but for now let’s get back to the matter at hand. In the meantime, why not check out Amazon’s docs on Redshift Spectrum?
In AWS you can configure VPCs (Virtual Private Clouds) which allow you to segregate and group resources and control security, data transfer, and all sorts of other things for all manner of reasons. Crucially though, some centralised AWS services, most importantly S3 (Simple Storage Service) which is the backbone of AWS, live outside your VPCs. Amazon don’t charge you to put data into AWS (why would they?) but they do charge you to take data out, or to move it around between regions and VPCs. It also means that traffic between your VPC and S3 has to go over the big bad Internet.
So this becomes important when you have data moving from “VPC-less” (at least in basic terms) services such as S3, and your resources that you’ve configured within a VPC, for example Redshift. Fortunately, AWS offers Enhanced VPC Routing, which allows you to route traffic between S3 and Redshift through your VPC, meaning you can control all kinds of aspects of this data movement such as DNS, security groups, ACLs, traffic monitoring and loads more. The advantages are obvious.
Again, I may touch on this in another post so I’ll leave it here for now. Amazon’s docs on Enhanced VPC Routing and Redshift.
Tucked away in the Spectrum small print, is a line that states “Your cluster can’t have Enhanced VPC Routing enabled.” This is a major blocker for anyone wanting to use Spectrum with an in-VPC Redshift cluster as it would mean either a new cluster would be required, or turning off Enhanced VPC Routing.
Fortunately, the newly appeared
spectrum_enable_enhanced_vpc_routing parameter suggests that this may be about to change. I’ve not seen anything from Amazon yet to confirm this, but watch this space!
Let me know in the comments below if you’ve seen any more on the topic, or any official comms from AWS.