Transfer files directly to Hadoop HDFS

I’ve been confronted with this problem a couple of times, so I figured out I should share this simple tip that came across pretty handy to my colleague who was trying to transfer a 70 Gb file to HDFS from his laptop.

So let’s say you have some files on your laptop you want to transfer to your Hadoop cluster, but you don’t have all the programs and config files installed that would allow you to simply type

hdfs dfs -put /path/to/local/file hdfs://distant-machine/path-to-hdfs-file

Instead, your only entry point to the Hadoop cluster is using scp/ssh to copy the files to an edge node (a machine that has access to the cluster) and then transfer them to HDFS with the hdfs dfs -put command.

Well, if you are confronted with this situation, try a simple pipe through ssh:

cat /path/to/local/file | ssh user@edgenode "hdfs dfs -put - hdfs://distant-machine/path-to-hdfs-file"

The quoted command will be run in the edge node, where the dash after the -put means “take stdin as input”. That’s it, no need to create a temp file in the edge node and then transfer to hdfs.

It also works the other way around. To copy a distant file from HDFS to your local storage, you could use the following command:

ssh user@edgenode "hdfs dfs -cat hdfs://distant-machine/path-to-hdfs-file" > /path/to/local/file

Isn’t that simple ?

Leave a comment