Often times the export/import activity may be limited on several performance bottlenecks. So, the activity may be faster if a distributed transfer is used instead of normal transfer. Some of the bottlenecks include Read Throughput , Write throughput , how the code parses the data(Inline or Batch etc,. Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Considering a sample MySQL Table in RDS
> mysql -h myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com -u mannem -p
mysql> describe dailyStockPrices;
+----------+--------------+------+-----+-------------------+-------+ | Field | Type | Null | Key | Default | Extra | +----------+--------------+------+-----+-------------------+-------+ | symbol | varchar(10) | YES | | NULL | | | name | varchar(100) | YES | | NULL | | | ask | varchar(10) | YES | | NULL | | | bid | varchar(10) | YES | | NULL | | | daysLow | varchar(10) | YES | | NULL | | | daysHigh | varchar(10) | YES | | NULL | | | ts | timestamp | YES | | CURRENT_TIMESTAMP | | +----------+--------------+------+-----+-------------------+-------+ 7 rows in set (0.00 sec)
mysql> SELECT table_name AS “Table”,
-> round(((data_length + index_length) / 1024 / 1024), 2) “Size in MB”
-> FROM information_schema.TABLES
-> WHERE table_schema = “pipetest”
-> AND table_name = “dailyStockPrices”;
+--------------------+------------+ | Table | Size in MB | +--------------------+------------+ | dailyStockPrices | 5373.00 | +--------------------+------------+ 1 row in set (0.00 sec)
My MySQL table dataset has lot of commas in the fields, so I choose TSV format instead of CSV to import/export. If I used CSV format, Sqoop will get confused parsing data.
Sqoop on EMR 4.4.0 + is pre-installed
Starting from EMR AMI version 4.4.0 ,Sqoop 1.4.6 is available as sandbox. This can be installed by simply selecting this option while provisioning the EMR cluster. By default, Sqoop on EMR has a MariaDB and PostgresSQL driver installed. To install an alternate set of JDBC connectors for Sqoop, you need to install them in /usr/lib/sqoop/lib.
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-sandbox.html#emr-sqoop
Sqoop on EMR 3.x.x can be installed with the following script:
# Install Apache sqoop on Hadoop Cluster's Master node#!/bin/bash #You may put this script in your s3 bucket: s3://<BUCKET_NAME>/install_sqoop.sh
This script works on all EMR 3.x.x AMI’s and can easily be extended for EMR 4.x.x AMI’s based on directory strutcture.
There is not need to run this script if EMR AMI is 4.4.0 or greater.
Download scoop from one of the mirrors.
#A sample tar for hadoop 2 is “sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz” from mirror :
http://apache.arvixe.com/sqoop/1.4.6/
And place this tarball in your s3 bucket
hadoop fs -copyToLocal s3://<BUCKET_NAME>/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz .
wget http://apache.arvixe.com/sqoop/1.4.6/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
tar -xzf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz cp sqoop-1.4.6.bin__hadoop-2.0.4-alpha /home/hadoop/
Download mysql connector library and copy this to your s3 bucket
A sample library can be found in mirror
http://mirrors.dotsrc.org/mysql/Downloads/Connector-J/mysql-connector-java-5.1.36.tar.gz
hadoop fs -copyToLocal s3://< bucket_name >/mysql-connector-java-5.1.36.tar.gz .
wget http://mirrors.dotsrc.org/mysql/Downloads/Connector-J/mysql-connector-java-5.1.36.tar.gz
tar -xzf mysql-connector-java-5.1.36.tar.gz cp mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar /home/hadoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/lib/
Add symlink, so that sqoop can be directly invoked from shell instead of absolute path.
sudo ln -s /home/hadoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/bin/sqoop /usr/bin/sqoop
#end
Import Commands:
- Import sqoop
- This command copies MySQL table from RDS to S3. The S3 file content type is TSV & File name will be in "part-m-00000" format.
- Note that with -m 1 , I am using single mapper task to run in parallel.
sqoop import --connect jdbc:mysql://myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com/pipetest --username mannem --password Password123 --table dailyStockPrices --target-dir s3://mannem/sqoopmatrix -m 1 --fields-terminated-by '\t' --lines-terminated-by '\n'- Check S3 contents
hadoop fs -cat s3://mannem/sqoopmatrix/part-m-00000- Sqoop command usage:
- http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
Export commands :
- Export to RDS(MySQL)
- This command copies TSV file from S3 to MySQL Table.
sqoop export --connect jdbc:mysql://myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com/pipetest --username mannem --password Password123 --table dailyStockPrices --input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --export-dir s3://mannem/sqoopmatrix/part-m-00000- Export to Redshift(PSQL)
sqoop export --connect jdbc:redshift://$MYREDSHIFTHOST:5439/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver com.amazon.redshift.jdbc41.Driver --username master --password Mymasterpass1- Export commands with mariadb connection string
sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1- Export with using Secure Socket Layer encryption
sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb?verifyServerCertificate=false&useSSL=true&requireSSL=true --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1
