If we attempt to inspect HDFS, we will not find anything interesting there:
The "-ls" command returns silently. Without any arguments, -ls will attempt to show the contents of your "home" directory inside HDFS. Don't forget, this is not the same as /home/$USER (e.g.,/home/someone) on the host machine (HDFS keeps a separate namespace from the local files). There is no concept of a "current working directory" or cd command in HDFS.
If you provide -ls with an argument, you may see some initial directory contents:
These entries are created by the system. This example output assumes that "hadoop" is the username under which the Hadoop daemons (NameNode, DataNode, etc) were started. "supergroup" is a special group whose membership includes the username under which the HDFS instances were started (e.g., "hadoop"). These directories exist to allow the Hadoop MapReduce system to move necessary data to the different job nodes; this is explained in more detail in Module 4.
So we need to create our home directory, and then populate it with some files.
class="sectionSubH">Inserting data into the cluster
Whereas a typical UNIX or Linux system stores individual users' files in /home/$USER, the Hadoop DFS stores these in /user/$USER. For some commands like ls, if a directory name is required and is left blank, this is the default directory name assumed. (Other commands require explicit source and destination paths.) Any relative paths used as arguments to HDFS, Hadoop MapReduce, or other components of the system are assumed to be relative to this base directory.
Step 1: Create your home directory if it does not already exist.
If there is no /user directory, create that first. It will be automatically created later if necessary, but for instructive purposes, it makes sense to create it manually ourselves this time.
Then we are free to add our own home directory:
Of course, replace /user/someone with /user/yourUserName.
Step 2: Upload a file. To insert a single file into HDFS, we can use the put command like so:
This copies /home/someone/interestingFile.txt from the local file system into/user/yourUserName/interestingFile.txt on HDFS.
Step 3: Verify the file is in HDFS. We can verify that the operation worked with either of the two following (equivalent) commands:
You should see a listing that starts with Found 1 items and then includes information about the file you inserted.
The following table demonstrates example uses of the put command, and their effects:
Command: | Assuming: | Outcome: |
---|---|---|
bin/hadoop dfs -put foo bar | No file/directory named/user/$USER/barexists in HDFS | Uploads local file foo to a file named/user/$USER/bar |
bin/hadoop dfs -put foo bar | /user/$USER/bar is a directory | Uploads local file foo to a file named/user/$USER/bar/foo |
bin/hadoop dfs -put foo somedir/somefile | /user/$USER/somedirdoes not exist in HDFS | Uploads local file foo to a file named/user/$USER/somedir/somefile, creating the missing directory |
bin/hadoop dfs -put foo bar | /user/$USER/bar is already a file in HDFS | No change in HDFS, and an error is returned to the user. |
When the put command operates on a file, it is all-or-nothing. Uploading a file into HDFS first copies the data onto the DataNodes. When they all acknowledge that they have received all the data and the file handle is closed, it is then made visible to the rest of the system. Thus based on the return value of the put command, you can be confident that a file has either been successfully uploaded, or has "fully failed;" you will never get into a state where a file is partially uploaded and the partial contents are visible externally, but the upload disconnected and did not complete the entire file contents. In a case like this, it will be as though no upload took place.
Step 4: Uploading multiple files at once. The put command is more powerful than moving a single file at a time. It can also be used to upload entire directory trees into HDFS.
Create a local directory and put some files into it using the cp command. Our example user may have a situation like the following:
This entire myfiles/ directory can be copied into HDFS like so:
Thus demonstrating that the tree was correctly uploaded recursively. You'll note that in addition to the file path, ls also reports the number of replicas of each file that exist (the "1" in <r 1>), the file size, upload time, permissions, and owner information.
Another synonym for -put is -copyFromLocal. The syntax and functionality are identical.
No comments:
Post a Comment