Skip to content

Commit ec39f21

Browse files
committed
feat: add download instructions
1 parent b0c35f3 commit ec39f21

1 file changed

Lines changed: 16 additions & 2 deletions

File tree

README.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -796,10 +796,24 @@ If you want to run many of these queries, and you have a lot of disk space, you'
796796
aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ .'
797797
```
798798
799-
or (if you don't have access through the AWS CLI):
799+
or (if you don't have access through the AWS CLI):
800800

801801
```shell
802-
TBA - add the download instructions
802+
mkdir -p cc-main-2024-22
803+
cd cc-main-2024-22
804+
805+
wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/cc-index-table.paths.gz
806+
gunzip cc-index-table.paths.gz
807+
808+
grep 'subset=warc' cc-index-table.paths | \
809+
awk '{print "https://data.commoncrawl.org/" $1, $1}' | \
810+
xargs -n 2 -P 10 sh -c '
811+
echo "Downloading: $2"
812+
mkdir -p "$(dirname "$2")" &&
813+
wget -O "$2" "$1"
814+
' _
815+
816+
cd -
803817
```
804818

805819
then you can run `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` to run the same query as above, but this time using your local copy of the index files.

0 commit comments

Comments
 (0)