WebHDFS python client library and simple shell.
- Python 3.4+
- Python requests module
Install python-webhdfs as a Debian package by building a deb:
dpkg-buildpackage
# or
pdebuild
Install python-webhdfs using the standard setuptools script:
python setup.py install
To use the WebHDFS Client API, start by importing the class from the module
>>> from webhdfs import WebHDFSClientAll functions may throw a WebHDFSError exception or one of these subclasses:
| Exception Type | Remote Exception | Description |
|---|---|---|
| WebHDFSConnectionError | Unable to connect to active NameNode | |
| WebHDFSIncompleteTransferError | Transferred file doesn't match origin size | |
| WebHDFSAccessControlError | AccessControlException | Access to specified path denied |
| WebHDFSIllegalArgumentError | IllegalArgumentException | Invalid parameter value |
| WebHDFSFileNotFoundError | FileNotFoundException | Specified path does not exist |
| WebHDFSSecurityError | SecurityException | Failed to obtain user/group information |
| WebHDFSUnsupportedOperationError | UnsupportedOperationException | Requested operation is not implemented |
| WebHDFSUnknownRemoteError | Remote exception unrecognized |
Creates a new WebHDFSClient object
Parameters:
base: base webhdfs url. (e.g. http://localhost:50070)user: user name with which to access all resourcesconf: (optional) path to hadoop configuration directory for NameNode HA resolutionwait: (optional) floating point number in seconds for request timeout waits
>>> import getpass
>>> hdfs = WebHDFSClient('http://localhost:50070', getpass.getuser(), conf='/etc/hadoop/conf', wait=1.5)Retrieves metadata about the specified HDFS item. Uses this WebHDFS REst request:
GET <BASE>/webhdfs/v1/<PATH>?op=GETFILESTATUS
Parameters:
path: HDFS path to fetchcatch: (optional) trapWebHDFSFileNotFoundErrorinstead of raising the exception
Returns:
- A single
WebHDFSObjectobject for the specified path. Falseif object not found in HDFS andcatch=True.
>>> o = hdfs.stat('/user')
>>> print o.full
/user
>>> print o.kind
DIRECTORY
>>> o = hdfs.stat('/foo', catch=True)
>>> print o
FalseLists a specified HDFS path. Uses this WebHDFS REst request:
GET <BASE>/webhdfs/v1/<PATH>?op=LISTSTATUS
Parameters:
path: HDFS path to listrecurse: (optional) descend down the directory treerequest: (optional) filter request callback for each returned object
Returns:
- Generator producing children
WebHDFSObjectobjects for the specified path.
>>> l = list(hdfs.ls('/')) # must convert to list if referencing by index
>>> print l[0].full
/user
>>> print l[0].kind
DIRECTORY
>>> l = list(hdfs.ls('/user', request=lambda x: x.name.startswith('m')))
>>> print l[0].full
/user/maxLists a specified HDFS path pattern. Uses this WebHDFS REst request:
GET <BASE>/webhdfs/v1/<PATH>?op=LISTSTATUS
Parameters:
path: HDFS path pattern to list
Returns:
- List of
WebHDFSObjectobjects for the specified pattern.
>>> l = hdfs.glob('/us*')
>>> print l[0].full
/user
>>> print l[0].kind
DIRECTORYGets the usage of a specified HDFS path. Uses this WebHDFS REst request:
GET <BASE>/webhdfs/v1/<PATH>?op=GETCONTENTSUMMARY
Parameters:
path: HDFS path to analyzereal: (optional) specifies return type
Returns:
- If
realisNone: Instance of aduobject:du(dirs=, files=, hdfs_usage=, disk_usage=, hdfs_quota=, disk_quota=) - If
realis a string: Integer for theduobject attribute name. - If
realis booleanTrue: Integer of hdfs bytes used by the specified path. - If
realis booleanFalse: Integer of disk bytes used by the specified path.
>>> u = hdfs.du('/user')
>>> print u
110433
>>> u = hdfs.du('/user', real=True)
>>> print u
331299
>>> u = hdfs.du('/user', real='disk_quota')
>>> print u
-1
>>> u = hdfs.du('/user', real=None)
>>> print u
du(dirs=3, files=5, hdfs_usage=110433, disk_usage=331299, hdfs_quota=-1, disk_quota=-1)Creates the specified HDFS path. Uses this WebHDFS rest request:
PUT <BASE>/webhdfs/v1/<PATH>?op=MKDIRS
Parameters:
path: HDFS path to create
Returns:
- Boolean
True
>>> hdfs.mkdir('/user/%s/test' % getpass.getuser())
TrueMoves/renames the specified HDFS path to specified destination. Uses this WebHDFS rest request:
PUT <BASE>/webhdfs/v1/<PATH>?op=RENAME&destination=<DEST>
Parameters:
path: HDFS path to move/renamedest: Destination path
Returns:
- Boolean
Trueon success andFalseon error
>>> hdfs.mv('/user/%s/test' % getpass.getuser(), '/user/%s/test.old' % getpass.getuser())
True
>>> hdfs.mv('/user/%s/test.old' % getpass.getuser(), '/some/non-existant/path')
FalseRemoves the specified HDFS path. Uses this WebHDFS rest request:
DELETE <BASE>/webhdfs/v1/<PATH>?op=DELETE
Parameters:
path: HDFS path to remove
Returns:
- Boolean
True
>>> hdfs.rm('/user/%s/test' % getpass.getuser())
TrueSets the replication factor for the specified HDFS path. Uses this WebHDFS rest request:
PUT <BASE>/webhdfs/v1/<PATH>?op=SETREPLICATION
Parameters:
path: HDFS path to changenum: new replication factor to apply
Returns:
- Boolean
Trueon success,Falseotherwise
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).repl
1
>>> hdfs.repl('/user/%s/test' % getpass.getuser(), 3).repl
True
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).repl
3Sets the owner and/or group of a specified HDFS path. Uses this WebHDFS REst request:
PUT <BASE>/webhdfs/v1/<PATH>?op=SETOWNER[&owner=<OWNER>][&group=<GROUP>]
Parameters:
path: HDFS path to changeowner: (optional) new object ownergroup: (optional) new object group
Returns:
- Boolean
Trueif ownership successfully applied
Raises:
WebHDFSIllegalArgumentErrorif both owner and group are unspecified or empty
>>> hdfs.chown('/user/%s/test' % getpass.getuser(), owner='other_owner', group='other_group')
True
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).owner
'other_owner'
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).group
'other_group'Sets the permission of a specified HDFS path. Uses this WebHDFS REst request:
PUT <BASE>/webhdfs/v1/<PATH>?op=SETPERMISSION&permission=<PERM>
Parameters:
path: HDFS path to changeperm: new object permission
Returns:
- Boolean
Trueif permission successfully applied
Raises:
WebHDFSIllegalArgumentErrorif permission is not octal integer under 0777
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).mode
'-rwxr-xr-x'
>>> hdfs.chmod('/user/%s/test' % getpass.getuser(), perm=0644)
True
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).mode
'-rw-r--r--'Sets the modification time of a specified HDFS path, optionally creating it. Uses this WebHDFS REst request:
PUT <BASE>/webhdfs/v1/<PATH>?op=SETTIMES&modificationtime=<TIME>
Parameters:
path: HDFS path to changetime: (optional) object modification time, represented as a Python datetime object orintepoch timestamp, defaulting to current time
Returns:
- Boolean
Trueif modification time successfully changed
Raises:
WebHDFSIllegalArgumentErrorif time is not a valid type
>>> hdfs.touch('/user/%s/new_test' % getpass.getuser())
True
>>> hdfs.stat('/user/%s/new_test' % getpass.getuser()).date
datetime.datetime(2019, 1, 28, 12, 10, 20)
>>> hdfs.touch('/user/%s/new_test' % getpass.getuser(), datetime.datetime(2018, 9, 27, 11, 1, 17))
True
>>> hdfs.stat('/user/%s/new_test' % getpass.getuser()).date
datetime.datetime(2018, 9, 27, 11, 1, 17)Fetches the specified HDFS path. Returns a string or writes a file, based on parameters. Uses this WebHDFS request:
GET <BASE>/webhdfs/v1/<PATH>?op=OPEN
Parameters:
path: HDFS path to fetchdata: (optional) file-like object open for write in binary mode
Returns:
- Boolean
Trueif data is set and written file size matches source - Bytes contents of the fetched file if data is None
Raises:
WebHDFSIncompleteTransferError
Creates the specified HDFS file using the contents of a file open for read, or value of the string. Uses this WebHDFS request:
PUT <BASE>/webhdfs/v1/<PATH>?op=CREATE
Parameters:
path: HDFS path to fetchdata: file-like object open for read in binary mode, bytes, or string
Returns:
- Boolean
Trueif written file size matches source
Raises:
WebHDFSIncompleteTransferError
Read-only property that retrieves number of HTTP requests performed so far.
>>> l = list(hdfs.ls('/user', recurse=True))
>>> hdfs.calls
11Creates a new WebHDFSObject object
Parameters:
>>> o = hdfs.stat('/')
>>> type(o)
<class 'webhdfs.attrib.WebHDFSObject'>Determines whether the HDFS object is a directory or not.
Parameters: None
Returns:
- boolean
Truewhen object is a directory,Falseotherwise
>>> o = hdfs.stat('/')
>>> o.is_dir()
TrueDetermines whether the HDFS object is empty or not.
Parameters: None
Returns:
- boolean
Truewhen object is a directory and has no children or a file and is of 0 size, andFalseotherwise
>>> o = hdfs.stat('/')
>>> o.is_empty()
FalseRead-only property that retreives the HDFS object owner.
>>> o = hdfs.stat('/')
>>> o.owner
'hdfs'Read-only property that retreives the HDFS object group.
>>> o = hdfs.stat('/')
>>> o.group
'supergroup'Read-only property that retreives the HDFS object base file name.
>>> o = hdfs.stat('/user/max')
>>> o.name
'max'Read-only property that retreives the HDFS object full file name.
>>> o = hdfs.stat('/user/max')
>>> o.full
'/user/max'Read-only property that retreives the HDFS object size in bytes.
>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.size
20552Read-only property that retreives the HDFS object replication factor.
>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.repl
1Read-only property that retreives the HDFS object type (FILE or DIRECTORY).
>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.kind
'FILE'Read-only property that retreives the HDFS object last modification timestamp as a Python datetime object.
>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.date
datetime.datetime(2015, 3, 7, 3, 53, 6)Read-only property that retreives the HDFS object symbolic permissions mode.
>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.mode
'-rw-r--r--'Read-only property that retreives the HDFS object octal permissions mode, usable by Python's stat module.
>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> oct(o.perm)
'0100644'
>>> stat.S_ISDIR(o.perm)
False
>>> stat.S_ISREG(o.perm)
Trueusage: webhdfs [-h] [-d CWD] [-l LOG] [-c CFG] [-t TIMEOUT] [-v]
url [cmd [cmd ...]]
webhdfs shell
positional arguments:
url webhdfs base url
cmd run this command and exit
optional arguments:
-h, --help show this help message and exit
-d CWD, --cwd CWD initial hdfs directory
-l LOG, --log LOG logger destination url
-c CFG, --cfg CFG hdfs configuration dir
-t TIMEOUT, --timeout TIMEOUT
request timeout in seconds
-v, --version print version and exit
supported logger formats:
console://?level=LEVEL
file://PATH?level=LEVEL
syslog+tcp://HOST:PORT/?facility=FACILITY&level=LEVEL
syslog+udp://HOST:PORT/?facility=FACILITY&level=LEVEL
syslog+unix://PATH?facility=FACILITY&level=LEVEL
Parameters:
url: base url for the WebHDFS endpoint, supporting http, https, and hdfs schemescmd: (optional) run the specified command with args and exit without starting the shell-d | --cwd: (optional) initial hdfs directory to switch to on shell invocation-l | --log: (optional) logger destination url as described by supported formats-c | --cfg: (optional) hadoop configuration directory for NameNode HA resolution-t | --timeout: (optional) request timeout in seconds as floating point number-v | --version: (optional) print shell/library version and exit
Environment Variables:
HADOOP_CONF_DIR: alternative to and takes precedence over the-c | --cfgcommand-line parameterWEBHDFS_HISTFILE: (optional) specify the preserved history file, defaulting to~/.webhdfs_historyWEBHDFS_HISTSIZE: (optional) specify the preserved history size, defaulting to 1000; set to 0 to disable