The tools that were going to be using today are DynamoDB, which is running on the Amazon cloud, and that’s our primary data repository and then and we’re using Elastic MapReduce or EMR, which in this case is running Hadoop, and then Hive and HDFS, and then Impala and finally S3 and these are all running on the Amazon cloud, or AWS. We’re not going to go into detail on the data itself, or the queries, because this is proprietary data from one of our customers, so parts of this will be blurred out. Instead we’re going to focus on the process of doing these kinds of ad hoc queries, and in this case of a cloud environment, which frankly is where we do most of this kind of stuff.
EMR Hadoop Cluster
We’ll start by spinning up an EMR Hadoop cluster. This is a fairly simple cluster with one master and two core nodes. So we’ll give it a name. We don’t need termination protection or logging in this case, we’re just doing kind of a onetime thing. This is running Hadoop, we could be running MapR too. Hive and Pig, I actually don’t need Pig but it’s ok we’ll just leave it there, we do need Impala though and the rest of this is pretty OK, we’ll go ahead and go with these instance types, those are fine. Don’t need IAM users. In this case there are no bootstrap actions that we need, or steps.
OK, so we’ll go ahead and create that cluster. Ok, so now that’s running and we’ve SSHed into the master to run our commands. We’ll start up by using Hive to extract the data that we’re interested in from DynamoDB. So we’ll set up a little bit of the environment here and then we’ll create a table to put our data in, in HDFS, and then we’ll do our import. This is often where a lot of the processing time is. Just make a little adjustment here. We’re not doing the real complex query here. What we’re doing is we’re just trying to extract a subset of the data just to reduce the domain where we’re going to look at for our real detailed queries. This will take awhile.
Now we’ve finished pulling the data from DynamoDB into HDFS and now we’re going to move to Impala. So we’ll quit out of Hive, and get into Impala. Impala is really a better tool for this kind of querying, in fact that’s really what Impala was built for. Now we’re going to define the tables for both the input data and the results. We need to tell Impala about the table we just created. That’s where the results are going to go. This is going to end up, we’re going to put this in an S3 bucket so that we can go in and grab it when it’s ready and download it. By that time this will be really crunched down so it’ll be a fairly small file to download. Now we’re going to do really the main query. This is really where we extract the detail and analyze the data. OK, so that’s done.
Now we’re going to get out of Impala and move back into Hive and again we’ll set up some environment parameters. Now we’re going to set up a table in S3 to export the data to and finally, we’re going to export the data. This takes a minute or two, it’s not too bad. OK And there we go. So we’ll go ahead and quit out of Hive. Oops, need a semicolon and we’ll exit SSH. OK so now we have our results data sitting in a bucket in S3, and this is in tab delimited format so we can just go ahead and pull that down and import it into Excel, o what have you. At this point it’s pretty amenable to a lot of different things, so we can go ahead and download that. We’ll just save that data and that’s it.