feat(storage): Add HDFS backend via opendal services-hdfs-native by jordepic · Pull Request #2441 · apache/iceberg-rust

jordepic · 2026-05-12T19:57:46Z

Which issue does this PR close?

Closes Support tables which live on HDFS #2440.

What changes are included in this PR?

Add an opendal-hdfs-native cargo feature plus
OpenDalStorageFactory::Hdfs and OpenDalStorage::Hdfs variants in
iceberg-storage-opendal. The variant uses OpenDAL's
services-hdfs-native, which talks the HDFS RPC protocol directly in
pure Rust (no JNI / libhdfs).

The wire-up in crates/storage/opendal/src/lib.rs and resolving.rs
follows the same factory + variant + scheme-routing pattern used by
every other OpenDAL-backed storage in that crate (s3, gcs, oss,
azdls), to the letter.

Mirroring the Java HadoopFileIO, no iceberg-level HDFS configuration
is exposed: there is no HdfsConfig type and no `hdfs.*` property
constants. HDFS topology (HA name services, namenode RPC addresses)
and Kerberos authentication are entirely delegated to hdfs-native
and its environment. hdfs-native reads core-site.xml / hdfs-site.xml
from HADOOP_CONF_DIR (or HADOOP_HOME/etc/hadoop); Kerberos auth is
delegated to libgssapi_krb5 via the standard KRB5CCNAME /
KRB5_CONFIG env. These deployment requirements are documented in the
opendal backend module (crates/storage/opendal/src/hdfs.rs). The
opendal-hdfs-native feature is added to opendal-all but not to the
default set; it requires libgssapi_krb5 installed at the OS level
for the runtime dlopen (brew install krb5 on macOS,
apt install libgssapi-krb5-2 on Debian).

Path parsing uses url::Url. Paths with an authority
(`hdfs://nameservice1/foo`) build an operator targeted at that name
node; authority-less paths (`hdfs:///foo`) build an operator without
an explicit name node, so hdfs-native picks up `fs.defaultFS` from
the loaded Hadoop config - the same behavior Java HadoopFileIO gets
via Hadoop's Configuration. OpenDalStorage::Hdfs holds a
per-name-node operator cache
(Arc<RwLock<HashMap<Option<String>, Operator>>> keyed by
`Some("hdfs://<authority>")` for authority-bearing paths and `None`
for authority-less paths). The cache uses stdlib primitives
(Arc + RwLock + HashMap) rather than a third-party concurrent map
crate; for the typical workload (one to a few nameservices,
read-heavy after warmup) the RwLock-guarded map is more than
adequate and avoids an additional workspace dependency. OpenDAL's
RetryLayer is applied uniformly to every operator at the call site;
no bespoke retry logic.

The integration-test docker-compose fixture mirrors apache/opendal's
own HDFS fixture (fixtures/hdfs/docker-compose-hdfs-cluster.yml):
the same bde2020 hadoop-namenode and hadoop-datanode images, both
running in host network mode. Host networking is required because
hdfs-native 0.13.5 connects to the DataNode by IP from
DatanodeIdProto.ip_addr; on a docker bridge the DN would register
with an unroutable bridge IP. Host networking works on Linux CI
runners but has known issues on macOS / Windows Docker Desktop, so
the integration tests are marked `#[ignore]` and CI explicitly opts
them in via `cargo nextest --run-ignored=only -E 'test(file_io_hdfs)'`
in the existing Linux-only tests job in .github/workflows/ci.yml.

Are these changes tested?

Yes, we have unit tests to ensure appropriate caching per name service as well as integration tests at parity with other storage backends.

jordepic · 2026-05-12T20:47:04Z

@blackmwk , @kevinjqliu , @CTTY would you mind taking a look at this guy? My org won't really be able to pick up iceberg-rust until we can read tables from hadoop.

Made a few deliberate choices here:

Went as rust-native as possible with test ignoring, unfortunately with HDFS's NN/DN design you need host networking with docker images (or some hacky code) to have the NN return the DN's address. I copied OpenDAL's tests, which I think sets a good precedent, as opposed to trying to build an in-process mini cluster.
Kept the cache for HDFS operators very simple and avoided external dependencies (needed because these maintain RPC connections). I completely removed HDFS configuration because hdfs-native reads the conf files on the host itself. I tried to follow all precedent set by GCS, S3, etc storage implementations.
All the code should be in a passing state by the time you guys take a look, really hoping to streamline the review here and be very intentional about what I added since I know you all have large review backlogs.

Thank you very much!

Add an opendal-hdfs-native cargo feature plus OpenDalStorageFactory::Hdfs and OpenDalStorage::Hdfs variants in iceberg-storage-opendal. The variant uses OpenDAL's services-hdfs-native, which talks the HDFS RPC protocol directly in pure Rust (no JNI / libhdfs). The wire-up in crates/storage/opendal/src/lib.rs and resolving.rs follows the same factory + variant + scheme-routing pattern used by every other OpenDAL-backed storage in that crate (s3, gcs, oss, azdls), to the letter. Mirroring the Java HadoopFileIO, no iceberg-level HDFS configuration is exposed: there is no HdfsConfig type and no `hdfs.*` property constants. HDFS topology (HA name services, namenode RPC addresses) and Kerberos authentication are entirely delegated to hdfs-native and its environment. hdfs-native reads core-site.xml / hdfs-site.xml from HADOOP_CONF_DIR (or HADOOP_HOME/etc/hadoop); Kerberos auth is delegated to libgssapi_krb5 via the standard KRB5CCNAME / KRB5_CONFIG env. These deployment requirements are documented in the opendal backend module (crates/storage/opendal/src/hdfs.rs). The opendal-hdfs-native feature is added to opendal-all but not to the default set; it requires libgssapi_krb5 installed at the OS level for the runtime dlopen (brew install krb5 on macOS, apt install libgssapi-krb5-2 on Debian). Path parsing uses url::Url. Paths with an authority (`hdfs://nameservice1/foo`) build an operator targeted at that name node; authority-less paths (`hdfs:///foo`) build an operator without an explicit name node, so hdfs-native picks up `fs.defaultFS` from the loaded Hadoop config - the same behavior Java HadoopFileIO gets via Hadoop's Configuration. OpenDalStorage::Hdfs holds a per-name-node operator cache (Arc<RwLock<HashMap<Option<String>, Operator>>> keyed by `Some("hdfs://<authority>")` for authority-bearing paths and `None` for authority-less paths). The cache uses stdlib primitives (Arc + RwLock + HashMap) rather than a third-party concurrent map crate; for the typical workload (one to a few nameservices, read-heavy after warmup) the RwLock-guarded map is more than adequate and avoids an additional workspace dependency. OpenDAL's RetryLayer is applied uniformly to every operator at the call site; no bespoke retry logic. The integration-test docker-compose fixture mirrors apache/opendal's own HDFS fixture (fixtures/hdfs/docker-compose-hdfs-cluster.yml): the same bde2020 hadoop-namenode and hadoop-datanode images, both running in host network mode. Host networking is required because hdfs-native 0.13.5 connects to the DataNode by IP from DatanodeIdProto.ip_addr; on a docker bridge the DN would register with an unroutable bridge IP. Host networking works on Linux CI runners but has known issues on macOS / Windows Docker Desktop, so the integration tests are marked `#[ignore]` and CI explicitly opts them in via `cargo nextest --run-ignored=only -E 'test(file_io_hdfs)'` in the existing Linux-only tests job in .github/workflows/ci.yml.

jordepic force-pushed the main branch from c48cb3f to 36b9565 Compare May 12, 2026 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(storage): Add HDFS backend via opendal services-hdfs-native#2441

feat(storage): Add HDFS backend via opendal services-hdfs-native#2441
jordepic wants to merge 1 commit into
apache:mainfrom
jordepic:main

jordepic commented May 12, 2026 •

edited

Loading

Uh oh!

jordepic commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jordepic commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

jordepic commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jordepic commented May 12, 2026 •

edited

Loading

jordepic commented May 12, 2026 •

edited

Loading