Sharing Data with S3

By Stephen Denham
Wednesday, December 16 2020

For better or for worse, S3 has become a common transport mechanism for sharing large datasets between partner organizations. This is particularly common in the “data lake”/”hadoop ecosystem”, leveraging tools such as Athena, Redshift, Snowflake, Hive, Pig, EMR, Glue, Lake Formation, etc.

In cases where two organizations are both leveraging AWS and want to share data, there are several common pitfalls. There is no “best solution“ – it’s highly contextual to your business, but we believe setting Cross Account Bucket Policies has a number of key features which can reduce the management overhead and make for a smooth long-term integration.

However, this is a commonly misunderstood and often unintuitive setup for many engineers, even those with a strong background in AWS and IAM.

Sharing Data With S3

Let’s imagine company A wants to get data from company B. They want this data for their internal reporting and machine learning algorithms, and because both companies run on AWS, they’ve agreed for company B to put data in company A’s S3 bucket

The Instinctive Approach

When first presented with this requirement, the first approach people usually reach is “we’ll create a user on our AWS account and give you access”.

Certainly this is an option but it can lead to some surprises.

Company B goes ahead and creates a user and sends the IAM key pair to Company A. Already we have something less than ideal going on – security. By not using a service role, these secrets aren’t automatically increased. They are secrets, likely having been forwarded through emails with multiple stakeholders involved through the business. If either company A or company B have compliance requirements to rotate secrets, this is rarely done correctly when different companies are involved and so we’ve created a compliance overhead.

But then we get to the crux of the issue. B is already using AWS for the process that writes the data. It needs to read that data from somewhere! Perhaps that’s another one of B’s S3 buckets, perhaps it’s from B’s Kinesis stream or Kafka. Regardless, if B is running on AWS, there could well be 10 different reasons B  needs AWS permissions to both A and B’s account, and you can only connect to one AWS ARN at a time! Here lies the unfortunate truth.

More concretely, many readers would be familiar with the aws s3 cp s3:from-bucket/file s3:to-bucket/file command, which is perhaps the most common way to transfer data between two buckets. This command is run with a single ARN connection to AWS.

Alternatively you could perform all the operations you need with one account, and then change account, to perform the operations with the other account, but if what we’re trying to do is transfer large amounts of data, then this could be tricky – especially in a world of Lambdas and containers, many companies don’t even have a process for managing EC2s with large volumes attached. Even more important, we’ve just doubled our amount of work and room for error. If something breaks and needs to start up again, you need to know where you’ve left off, so are you going to do that manually? You’ve just introduced several new problems and significantly increased your operational overhead.

Cross Account Access

But what about “switching role” or “cross account access”? Surely this is the solution to connecting to two AWS accounts with the same auth? Indeed this does allow you to do this, but not at the same time so unfortunately an s3 cp command still won’t work. I’m beginning to think this is one of those things people need to try themselves to believe. I remember when I myself initially came across this problem I refused to accept that this ‘cross account access’ didn’t meet my cross account access needs.

 

So What’s the Solution?

Enter “cross-account bucket permissions” which sounds painfully close to ‘cross account access’ but is something entirely different. It’s the recommended approach from the AWS Solutions Architect exam.

The steps described are lengthy and a little opaque, but if you follow these to the letter, you end up with an extremely simple solution and that technical credit continues over time.

Collaboration

What we’re talking about here is a solution for two companies working on AWS to leverage, and in our experience, this is always a challenging process. When speaking with people on the other end of the phone you often have no context into their area of expertise or their position in an organization. Also, if a transfer solution is good for one party but bad for another, they both have to live with that system and so taking the time to understand your partner’s context and their technical constraints, will ultimately end in a win-win.

What’s next?

Did you find this content useful? Want to know more about district m? Please contact our staff at [email protected]. And, by the way, because our staff is now free of many repetitive tasks, they are available to give you immediate attention.

district m employee Stephen Denham

About the author

Stephen is a senior staff engineer at district m. He is passionate about bringing data products to life and making District M a product company.

Check out our solutions in action

Get in touch to schedule a live demo of our platforms with one of our dedicated experts.

Get in touch with one of our experts

If you’re interested in learning more about how we can help your business, reach out to us!