In this blog post, I will show you how to „ETL“ all kinds of data to Amazon’s cloud data warehouse Redshift wit Talend’s big data components. Let’s begin with a short introduction to Amazon Redshift (copied from website):
„Amazon Redshift is [part of Amazon Web Services (AWS) and] a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. With a few clicks in the AWS Management Console, customers can launch a Redshift cluster, starting with a few hundred gigabytes and scaling to a petabyte or more, for under $1,000 per terabyte per year.
Traditional data warehouses require significant time and resource to administer, especially for large datasets. In addition, the financial cost associated with building, maintaining, and growing self-managed, on-premise data warehouses is very high. Amazon Redshift not only significantly lowers the cost of a data warehouse, but also makes it easy to analyze large amounts of data very quickly.“
Sounds interesting! And indeed, we already see companies using Talend’s Redshift connectors. From Talend perspective it is not much more than just another database. If you have ever used a Talend connector, you can integrate to Redshift within some minutes. In the next sections, I will describe all necessary steps and give some hints regarding configuration issues and performance improvements.
Be aware: You need Talend Open Studio for Data Integration (open source) or any Talend Enterprise Edition / Platform which contains the Cloud components to see and use Amazon Redshift connectors. The open source edition offers all connectors and functionality to integrate with Amazon Redshift. However, Enterprise versions offer some more features (e.g. versioning), comfort (e.g. wizards) and commercial support.
Setup Amazon Redshift
Setup of Amazon Redshift is very easy. Just follow Amazon‘s getting started guide: http://docs.aws.amazon.com/redshift/latest/gsg/welcome.html. Like every other AWS guide, it is very easy to understand and use.
Be aware, that you just have to do step 1, 2 and 3 of the getting started guide for using it with Talend. Some hints:
– Step 1 („before you begin“): Just sign up. Client tools and drivers are not necessary because they are already installed within Talend Studio.
– Step 2 („launch a cluster“): Yes, please start your cluster!
– Step 3(„authorize access“): If you are not sure what to do here, select Connection Type = CIDR/IP. Find out your IP address (http://whatismyipaddress.com) and enter it with „/32“ at the end. Example: „192.168.1.1/32“
Now you can connect to Amazon Redshift from your Talend Studio on your local computer. Step 4 (connect) and step 5 (create table, data, queries) are not necessary, this will be done from Talend Studio.
Of course, you should not forget to delete your cluster (step 7) when you are done. Otherwise, you will pay for every hour, even if you do not access your DWH.
Connect to Amazon Redshift from Talend Studio
Create a new connection to Amazon Redshift database as you do with every other relational database. The easiest way is to use „DB Connection Wizard“ in metadata. Just enter your connection information and check if it works. You get all information about configuration from Amazon Web Console. The connection string looks something like this:
„jdbc:paraccel://talend-demo-cluster.cp8t6c5.eu-west-1.redshift.amazonaws.com:5439/dev“
Next, right click on the created connection and select „retrieve schema“. „public“ is the default schema which you (have to) use. Now, you are ready to use this connection within Talend Jobs to write to Amazon Redshift and read from it.
Create Talend Jobs (Write, Read, Delete)
Amazon Redshift components work like any other Talend (relational) database components. Look at www.help.talend.com for more information if you have not used them before (or just try them out, they are very self-explanatory).
You just have to drag&drop your connection from metadata . Afterwards, you can easily write data (tRedShiftOutput), read data (tRedshiftInput), or do any other queries such as delete or copy (tRedShiftRow). In the following job, I start with deleting all content in the Amazon Redshift table. Then, I read data from a MySQL table and insert it into an Amazon Redshift table. The table is created automatically (as I have configured it this way). After this subjob is finished, I read the data again, and store it to a CSV file (which is also created automatically). Of course, this is no business use case, but it shows how to use different Amazon Redshift components.
Query Data from Amazon Redshift
You can connect to Amazon Redshift directly from Talend Studio to explore and query data of the DWH. Thus, no other database tool is required. Just right click on your Amazon Redshift connection in metadata and select „edit queries“. Here you can define, execute and save SQL queries.
Improve Performance
Write performance of Amazon Redshift is relatively low compared to „classical“ relational databases (in your data center) as you have to upload all data into the cloud. Different alternatives exist to improve performance:
– Bulk inserts: „Extended insert“ (in advanced settings) improves performance a lot, but still not to hyperspeed… Also, as it is bulk, you can just do inserts! It is not compatible to „rejects“ or „updates“
– AWS S3 and COPY command: S3 is Amazon’s „simple storage service“, a key-value store – also called NoSQL today – for storing very large objects. You can use Amazon Redshift’s COPY command (http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) to transfer data from S3 to Amazon Redshift with good performance. Though, you still have to copy data to S3 before, same „cloud problem“ here. The COPY command can be used with tRedshiftRow, so no problem at all from Talend perspective. To transfer data to S3, you can either use the Talend S3 components from Talendforge, Talend’s open source community (http://www.talendforge.org/exchange), or use camel-s3, an Apache Camel component which is included in Talend ESB. The latter is an option, if you use Talend Data Services which combines Talend DI and Talend ESB in its unified platform.
Summary
You need not be a cloud or DWH expert, or an expert developer to integrate with Amazon’s cloud data warehouse Redshift. It is very easy with Talend’s integration solutions. Just drag&drop, configure, do some graphical mappings / transformations (if necessary), that’s it. Code is generated. Job runs. You can integrate Amazon Redshift almost as simple as any other relational database. Just be aware of some cloud specific security and performance issues. With Talend, you can easily „ETL“ all data from different sources to Redshift and store it there for under $1,000 per terabyte per year – even with the open source version!
Best regards,
Kai Wähner
(Contact and feedback via @KaiWaehner, www.kai-waehner.de, LinkedIn / Xing)
4 comments
Regarding performance questions:
With Talend 5.4 (release in Dezember 2014), Talend will offer official Amazon AWS S3 components within Talend Studio. So, no longer needed to use any other alternative.
How to schedule ETL jobs with Talend open studio. My Data source is Sql Server destination is amazon redshift. I need to schedule ETL jobs in hourly basis.Can you please give me some ideas ,is talend open studio right option..?
Hi Sravan,
I no longer work for Talend. Maybe you should ask in the Talend forum. As far as I remember, you have to use scripting. A user interface for managing ETL jobs is only available in the commercial version.
Kai
hello,
thank you for the article about redshift, i also think its very good solution for companies that don’t have a lot of money to start with business intelligence.
I personally prefer pentaho kettle as ETL solution
but in the case of redshift its just a lot faster to upload a csv file to the s3 (using pentaho / talend) bucket and then use the copy command to load to redshift (via scheduler – like cron)
http://docs.aws.amazon.com/redshift/latest/dg/t_loading-tables-from-s3.html