feat(storage): Add HDFS backend via opendal services-hdfs-native#2441
Open
jordepic wants to merge 1 commit into
Open
feat(storage): Add HDFS backend via opendal services-hdfs-native#2441jordepic wants to merge 1 commit into
jordepic wants to merge 1 commit into
Conversation
Author
|
@blackmwk , @kevinjqliu , @CTTY would you mind taking a look at this guy? My org won't really be able to pick up iceberg-rust until we can read tables from hadoop. Made a few deliberate choices here:
Thank you very much! |
Add an opendal-hdfs-native cargo feature plus
OpenDalStorageFactory::Hdfs and OpenDalStorage::Hdfs variants in
iceberg-storage-opendal. The variant uses OpenDAL's
services-hdfs-native, which talks the HDFS RPC protocol directly in
pure Rust (no JNI / libhdfs).
The wire-up in crates/storage/opendal/src/lib.rs and resolving.rs
follows the same factory + variant + scheme-routing pattern used by
every other OpenDAL-backed storage in that crate (s3, gcs, oss,
azdls), to the letter.
Mirroring the Java HadoopFileIO, no iceberg-level HDFS configuration
is exposed: there is no HdfsConfig type and no `hdfs.*` property
constants. HDFS topology (HA name services, namenode RPC addresses)
and Kerberos authentication are entirely delegated to hdfs-native
and its environment. hdfs-native reads core-site.xml / hdfs-site.xml
from HADOOP_CONF_DIR (or HADOOP_HOME/etc/hadoop); Kerberos auth is
delegated to libgssapi_krb5 via the standard KRB5CCNAME /
KRB5_CONFIG env. These deployment requirements are documented in the
opendal backend module (crates/storage/opendal/src/hdfs.rs). The
opendal-hdfs-native feature is added to opendal-all but not to the
default set; it requires libgssapi_krb5 installed at the OS level
for the runtime dlopen (brew install krb5 on macOS,
apt install libgssapi-krb5-2 on Debian).
Path parsing uses url::Url. Paths with an authority
(`hdfs://nameservice1/foo`) build an operator targeted at that name
node; authority-less paths (`hdfs:///foo`) build an operator without
an explicit name node, so hdfs-native picks up `fs.defaultFS` from
the loaded Hadoop config - the same behavior Java HadoopFileIO gets
via Hadoop's Configuration. OpenDalStorage::Hdfs holds a
per-name-node operator cache
(Arc<RwLock<HashMap<Option<String>, Operator>>> keyed by
`Some("hdfs://<authority>")` for authority-bearing paths and `None`
for authority-less paths). The cache uses stdlib primitives
(Arc + RwLock + HashMap) rather than a third-party concurrent map
crate; for the typical workload (one to a few nameservices,
read-heavy after warmup) the RwLock-guarded map is more than
adequate and avoids an additional workspace dependency. OpenDAL's
RetryLayer is applied uniformly to every operator at the call site;
no bespoke retry logic.
The integration-test docker-compose fixture mirrors apache/opendal's
own HDFS fixture (fixtures/hdfs/docker-compose-hdfs-cluster.yml):
the same bde2020 hadoop-namenode and hadoop-datanode images, both
running in host network mode. Host networking is required because
hdfs-native 0.13.5 connects to the DataNode by IP from
DatanodeIdProto.ip_addr; on a docker bridge the DN would register
with an unroutable bridge IP. Host networking works on Linux CI
runners but has known issues on macOS / Windows Docker Desktop, so
the integration tests are marked `#[ignore]` and CI explicitly opts
them in via `cargo nextest --run-ignored=only -E 'test(file_io_hdfs)'`
in the existing Linux-only tests job in .github/workflows/ci.yml.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
What changes are included in this PR?
Are these changes tested?
Yes, we have unit tests to ensure appropriate caching per name service as well as integration tests at parity with other storage backends.