Skip to content

feat: add direct remote access over s3 and https via warcio >= 1.8.0#25

Draft
handecelikkanat wants to merge 1 commit intomainfrom
feat/s3-access-via-warcio1.8
Draft

feat: add direct remote access over s3 and https via warcio >= 1.8.0#25
handecelikkanat wants to merge 1 commit intomainfrom
feat/s3-access-via-warcio1.8

Conversation

@handecelikkanat
Copy link
Copy Markdown
Contributor

@handecelikkanat handecelikkanat commented Apr 9, 2026

From https://github.com/commoncrawl/issues/issues/684

This PR adds direct remote access (s3, https) to warc/wet/wat files in S3 buckets, using warcio.

Since 1.8.0, warcio supports direct remote file access over s3 and https: https://github.com/webrecorder/warcio/blob/master/CHANGELIST.rst

This PR adds:

  • fsspec_open call from warcio.utils to replace local open call in warcio-iterator.py
  • New make target to remote access directly over s3: make iterate-remote-s3
  • [If Greg confirms to use public bucket] New make target to remote access directly over https: make iterate-remote-https
  • New section to README.md to run these targets: Task 2-i: Iterating over "Remote" Files
  • [ ] New section to use warcio index locally vs remotely
    • Decided not to, comments?
  • warcio extract might be able to work with remote files now
  • Does cdxj-indexer work with remote files?
  • New requirement warcio[s3]>=1.8.0

Note:
I still keep task/target to iterate over local files from the Github repo. (make iterate)
I think this is a gentle start.
Might invite people to check files better, in their local.

@malteos
Copy link
Copy Markdown

malteos commented Apr 10, 2026

fsspec_open call from warcio.utils

This seems unnecessary. You could open the remote files directly via fsspec. No need to use the warcio utils.

To illustrate the S3 support of warcio, you could call the warcio CLI directly without the custom python script.

@handecelikkanat
Copy link
Copy Markdown
Contributor Author

handecelikkanat commented Apr 10, 2026

fsspec_open call from warcio.utils

This seems unnecessary. You could open the remote files directly via fsspec. No need to use the warcio utils.

Previously this used a local file open, Ill check fsspec.

To illustrate the S3 support of warcio, you could call the warcio CLI directly without the custom python script.

I was now thinking that warcio extract should be working with remote files as well. Ill modify that task: cdx index extract info -> warcio extract over (local and) remote files.

Any other suggestions? warcio index looks potentially confusable with cdx index to me, because of "index" label.

@handecelikkanat
Copy link
Copy Markdown
Contributor Author

handecelikkanat commented Apr 10, 2026

@malteos Can cdxj-indexer work with remote files (maybe through warcio) now?

@handecelikkanat
Copy link
Copy Markdown
Contributor Author

handecelikkanat commented Apr 10, 2026

@malteos Can cdxj-indexer work with remote files (maybe through warcio) now?

I guess this is not guaranteed. I see that they include warcio but not s3, and dont force > 1.8.0: https://github.com/webrecorder/cdxj-indexer/blob/9ad2b9e1c54d2d20c391050fdb831ca1ee981504/setup.py#L49

Ill continue assuming it needs to work on local files.

EDIT: Explained by Greg that this can be handled by making the requirement stricter in whirlwind side ✔️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants