File tree Expand file tree Collapse file tree
Expand file tree Collapse file tree Original file line number Diff line number Diff line change @@ -31,15 +31,15 @@ extract:
3131# python ./warcio-iterator.py TEST-000000.extracted.warc.gz
3232# @echo
3333#
34- # download_collinfo:
35- # @echo "downloading collinfo.json so we can find out the crawl name"
36- # curl -O https://index.commoncrawl.org/collinfo.json
37- #
38- # CC-MAIN-2024-22.warc.paths.gz:
39- # @echo "downloading the list from s3, requires s3 auth even though it is free"
40- # @echo "note that this file should be in the repo"
41- # aws s3 ls s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ | awk '{print $$4}' | gzip -9 > CC-MAIN-2024-22.warc.paths.gz
42- #
34+ download_collinfo :
35+ @echo " downloading collinfo.json so we can find out the crawl name"
36+ curl -O https://index.commoncrawl.org/collinfo.json
37+
38+ CC-MAIN-2024-22.warc.paths.gz :
39+ @echo " downloading the list from s3, requires s3 auth even though it is free"
40+ @echo " note that this file should be in the repo"
41+ aws s3 ls s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ | awk ' {print $$4}' | gzip -9 > CC-MAIN-2024-22.warc.paths.gz
42+
4343# duck_local_files:
4444# @echo "warning! 300 gigabyte download"
4545# python duck.py local_files
You can’t perform that action at this time.
0 commit comments