Thursday, 4 October 2012

Listing files - Interacting With Hadoop Distributed File System (HDFS) - Hadoop Online Training


If we attempt to inspect HDFS, we will not find anything interesting there:
  someone@anynode:hadoop$ bin/hadoop dfs -ls
  someone@anynode:hadoop$
The "-ls" command returns silently. Without any arguments, -ls will attempt to show the contents of your "home" directory inside HDFS. Don't forget, this is not the same as /home/$USER (e.g.,/home/someone) on the host machine (HDFS keeps a separate namespace from the local files). There is no concept of a "current working directory" or cd command in HDFS.
If you provide -ls with an argument, you may see some initial directory contents:
  someone@anynode:hadoop$ bin/hadoop dfs -ls /
  Found 2 items
  drwxr-xr-x   - hadoop supergroup          0 2008-09-20 19:40 /hadoop
  drwxr-xr-x   - hadoop supergroup          0 2008-09-20 20:08 /tmp
These entries are created by the system. This example output assumes that "hadoop" is the username under which the Hadoop daemons (NameNode, DataNode, etc) were started. "supergroup" is a special group whose membership includes the username under which the HDFS instances were started (e.g., "hadoop"). These directories exist to allow the Hadoop MapReduce system to move necessary data to the different job nodes; this is explained in more detail in Module 4.
So we need to create our home directory, and then populate it with some files.
class="sectionSubH">Inserting data into the cluster
Whereas a typical UNIX or Linux system stores individual users' files in /home/$USER, the Hadoop DFS stores these in /user/$USER. For some commands like ls, if a directory name is required and is left blank, this is the default directory name assumed. (Other commands require explicit source and destination paths.) Any relative paths used as arguments to HDFS, Hadoop MapReduce, or other components of the system are assumed to be relative to this base directory.
Step 1: Create your home directory if it does not already exist.
  someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user
If there is no /user directory, create that first. It will be automatically created later if necessary, but for instructive purposes, it makes sense to create it manually ourselves this time.
Then we are free to add our own home directory:
  someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user/someone
Of course, replace /user/someone with /user/yourUserName.
Step 2: Upload a file. To insert a single file into HDFS, we can use the put command like so:
  someone@anynode:hadoop$ bin/hadoop dfs -put /home/someone/interestingFile.txt /user/yourUserName/
This copies /home/someone/interestingFile.txt from the local file system into/user/yourUserName/interestingFile.txt on HDFS.
Step 3: Verify the file is in HDFS. We can verify that the operation worked with either of the two following (equivalent) commands:
  someone@anynode:hadoop$ bin/hadoop dfs -ls /user/yourUserName
  someone@anynode:hadoop$ bin/hadoop dfs -ls
You should see a listing that starts with Found 1 items and then includes information about the file you inserted.
The following table demonstrates example uses of the put command, and their effects:
Command:Assuming:Outcome:
bin/hadoop dfs -put foo barNo file/directory named/user/$USER/barexists in HDFSUploads local file foo to a file named/user/$USER/bar
bin/hadoop dfs -put foo bar/user/$USER/bar is a directoryUploads local file foo to a file named/user/$USER/bar/foo
bin/hadoop dfs -put foo somedir/somefile/user/$USER/somedirdoes not exist in HDFSUploads local file foo to a file named/user/$USER/somedir/somefile, creating the missing directory
bin/hadoop dfs -put foo bar/user/$USER/bar is already a file in HDFSNo change in HDFS, and an error is returned to the user.
When the put command operates on a file, it is all-or-nothing. Uploading a file into HDFS first copies the data onto the DataNodes. When they all acknowledge that they have received all the data and the file handle is closed, it is then made visible to the rest of the system. Thus based on the return value of the put command, you can be confident that a file has either been successfully uploaded, or has "fully failed;" you will never get into a state where a file is partially uploaded and the partial contents are visible externally, but the upload disconnected and did not complete the entire file contents. In a case like this, it will be as though no upload took place.
Step 4: Uploading multiple files at once. The put command is more powerful than moving a single file at a time. It can also be used to upload entire directory trees into HDFS.
Create a local directory and put some files into it using the cp command. Our example user may have a situation like the following:
  someone@anynode:hadoop$ ls -R myfiles
  myfiles:
  file1.txt  file2.txt  subdir/

  myfiles/subdir:
  anotherFile.txt
  someone@anynode:hadoop$
This entire myfiles/ directory can be copied into HDFS like so:
  someone@anynode:hadoop$ bin/hadoop -put myfiles /user/myUsername
  someone@anynode:hadoop$ bin/hadoop -ls
  Found 1 items
  /user/someone/myfiles   <dir>    2008-06-12 20:59    rwxr-xr-x    someone    supergroup
  user@anynode:hadoop bin/hadoop -ls myfiles
  Found 3 items
  /user/someone/myfiles/file1.txt   <r 1>   186731  2008-06-12 20:59        rw-r--r--       someone   supergroup
  /user/someone/myfiles/file2.txt   <r 1>   168     2008-06-12 20:59        rw-r--r--       someone   supergroup
  /user/someone/myfiles/subdir      <dir>           2008-06-12 20:59        rwxr-xr-x       someone   supergroup
Thus demonstrating that the tree was correctly uploaded recursively. You'll note that in addition to the file path, ls also reports the number of replicas of each file that exist (the "1" in <r 1>), the file size, upload time, permissions, and owner information.
Another synonym for -put is -copyFromLocal. The syntax and functionality are identical.

No comments:

Post a Comment