emremr

Often times the export/import activity may be limited on several performance bottlenecks. So, the activity may be faster if a distributed transfer is used instead of normal transfer. Some of the bottlenecks include Read Throughput , Write throughput , how the code parses the data(Inline or Batch etc,. Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

http://sqoop.apache.org/

This guide shows you > To Install sqoop and > Export/Import MySQL tables (from S3 to RDS) ,(from RDS to S3) respectively.

Considering a sample MySQL Table in RDS

> mysql -h myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com -u mannem -p

mysql> describe dailyStockPrices;

+----------+--------------+------+-----+-------------------+-------+
| Field    | Type         | Null | Key | Default           | Extra |
+----------+--------------+------+-----+-------------------+-------+
| symbol   | varchar(10)  | YES  |     | NULL              |       |
| name     | varchar(100) | YES  |     | NULL              |       |
| ask      | varchar(10)  | YES  |     | NULL              |       |
| bid      | varchar(10)  | YES  |     | NULL              |       |
| daysLow  | varchar(10)  | YES  |     | NULL              |       |
| daysHigh | varchar(10)  | YES  |     | NULL              |       |
| ts       | timestamp    | YES  |     | CURRENT_TIMESTAMP |       |
+----------+--------------+------+-----+-------------------+-------+
7 rows in set (0.00 sec)

mysql> SELECT table_name AS “Table”, -> round(((data_length + index_length) / 1024 / 1024), 2) “Size in MB” -> FROM information_schema.TABLES -> WHERE table_schema = “pipetest” -> AND table_name = “dailyStockPrices”;

+--------------------+------------+
| Table            | Size in MB   |
+--------------------+------------+
| dailyStockPrices | 5373.00      |
+--------------------+------------+
1 row in set (0.00 sec)

My MySQL table dataset has lot of commas in the fields, so I choose TSV format instead of CSV to import/export. If I used CSV format, Sqoop will get confused parsing data.

Sqoop on EMR 4.4.0 + is pre-installed

Starting from EMR AMI version 4.4.0 , Sqoop 1.4.6 is available as sandbox. This can be installed by simply selecting this option while provisioning the EMR cluster. By default, Sqoop on EMR has a MariaDB and PostgresSQL driver installed. To install an alternate set of JDBC connectors for Sqoop, you need to install them in /usr/lib/sqoop/lib.

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-sandbox.html#emr-sqoop

Sqoop on EMR 3.x.x can be installed with the following script:

# Install Apache sqoop on Hadoop Cluster's Master node

#!/bin/bash #You may put this script in your s3 bucket: s3://<BUCKET_NAME>/install_sqoop.sh

This script works on all EMR 3.x.x AMI’s and can easily be extended for EMR 4.x.x AMI’s based on directory strutcture.

There is not need to run this script if EMR AMI is 4.4.0 or greater.

Download scoop from one of the mirrors.

#A sample tar for hadoop 2 is “sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz” from mirror :

http://apache.arvixe.com/sqoop/1.4.6/

And place this tarball in your s3 bucket

hadoop fs -copyToLocal s3://<BUCKET_NAME>/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz .

wget http://apache.arvixe.com/sqoop/1.4.6/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz

tar -xzf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz cp sqoop-1.4.6.bin__hadoop-2.0.4-alpha /home/hadoop/

Download mysql connector library and copy this to your s3 bucket

A sample library can be found in mirror

http://mirrors.dotsrc.org/mysql/Downloads/Connector-J/mysql-connector-java-5.1.36.tar.gz

hadoop fs -copyToLocal s3://< bucket_name >/mysql-connector-java-5.1.36.tar.gz .

wget http://mirrors.dotsrc.org/mysql/Downloads/Connector-J/mysql-connector-java-5.1.36.tar.gz

tar -xzf mysql-connector-java-5.1.36.tar.gz cp mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar /home/hadoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/lib/

Add symlink, so that sqoop can be directly invoked from shell instead of absolute path.

sudo ln -s /home/hadoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/bin/sqoop /usr/bin/sqoop

#end

Import Commands:

Import sqoop
This command copies MySQL table from RDS to S3. The S3 file content type is TSV & File name will be in "part-m-00000" format.
Note that with -m 1 , I am using single mapper task to run in parallel.
sqoop import --connect jdbc:mysql://myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com/pipetest --username mannem --password Password123 --table dailyStockPrices --target-dir s3://mannem/sqoopmatrix -m 1 --fields-terminated-by '\t' --lines-terminated-by '\n'
Check S3 contents
hadoop fs -cat s3://mannem/sqoopmatrix/part-m-00000
Sqoop command usage:
http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

Export commands :

Before export, The Destination MySQL/PSQL table should already be created with a similar schema.
Export to RDS(MySQL)
This command copies TSV file from S3 to MySQL Table.
sqoop export --connect jdbc:mysql://myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com/pipetest --username mannem --password Password123 --table dailyStockPrices --input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --export-dir s3://mannem/sqoopmatrix/part-m-00000
Export to Redshift(PSQL)
sqoop export --connect jdbc:redshift://$MYREDSHIFTHOST:5439/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver com.amazon.redshift.jdbc41.Driver --username master --password Mymasterpass1
Export commands with mariadb connection string
sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1
Export with using Secure Socket Layer encryption
sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb?verifyServerCertificate=false&useSSL=true&requireSSL=true --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1
<fieldset>
	<legend>scratch</legend>
	<div><label for="f1">Use this area to play around with the commands</label><br/><textarea rows="5" cols="80" id="f1"></textarea></div>
</fieldset>