Skip to content

Commit b3c7252

Browse files
committed
feat: task 7
1 parent 3ed8d61 commit b3c7252

1 file changed

Lines changed: 9 additions & 9 deletions

File tree

Makefile

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -31,15 +31,15 @@ extract:
3131
# python ./warcio-iterator.py TEST-000000.extracted.warc.gz
3232
# @echo
3333
#
34-
# download_collinfo:
35-
# @echo "downloading collinfo.json so we can find out the crawl name"
36-
# curl -O https://index.commoncrawl.org/collinfo.json
37-
#
38-
# CC-MAIN-2024-22.warc.paths.gz:
39-
# @echo "downloading the list from s3, requires s3 auth even though it is free"
40-
# @echo "note that this file should be in the repo"
41-
# aws s3 ls s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ | awk '{print $$4}' | gzip -9 > CC-MAIN-2024-22.warc.paths.gz
42-
#
34+
download_collinfo:
35+
@echo "downloading collinfo.json so we can find out the crawl name"
36+
curl -O https://index.commoncrawl.org/collinfo.json
37+
38+
CC-MAIN-2024-22.warc.paths.gz:
39+
@echo "downloading the list from s3, requires s3 auth even though it is free"
40+
@echo "note that this file should be in the repo"
41+
aws s3 ls s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ | awk '{print $$4}' | gzip -9 > CC-MAIN-2024-22.warc.paths.gz
42+
4343
# duck_local_files:
4444
# @echo "warning! 300 gigabyte download"
4545
# python duck.py local_files

0 commit comments

Comments
 (0)