feat: add direct remote access over s3 and https via warcio >= 1.8.0#25
feat: add direct remote access over s3 and https via warcio >= 1.8.0#25handecelikkanat wants to merge 1 commit intomainfrom
Conversation
This seems unnecessary. You could open the remote files directly via fsspec. No need to use the warcio utils. To illustrate the S3 support of warcio, you could call the warcio CLI directly without the custom python script. |
Previously this used a local file open, Ill check fsspec.
I was now thinking that Any other suggestions? |
|
@malteos Can |
EDIT: Explained by Greg that this can be handled by making the requirement stricter in whirlwind side ✔️ |
From https://github.com/commoncrawl/issues/issues/684
This PR adds direct remote access (s3, https) to warc/wet/wat files in S3 buckets, using warcio.
Since 1.8.0, warcio supports direct remote file access over s3 and https: https://github.com/webrecorder/warcio/blob/master/CHANGELIST.rst
This PR adds:
fsspec_opencall fromwarcio.utilsto replace local open call inwarcio-iterator.pymake iterate-remote-s3make iterate-remote-httpsTask 2-i: Iterating over "Remote" Files[ ] New section to usewarcio indexlocally vs remotelywarcio extractmight be able to work with remote files nowcdxj-indexerwork with remote files?warcio[s3]>=1.8.0Note:
I still keep task/target to iterate over local files from the Github repo. (
make iterate)I think this is a gentle start.
Might invite people to check files better, in their local.