Skip to content

Commit a91e52f

Browse files
committed
chore(docu): minor changes
1 parent e38a0ce commit a91e52f

2 files changed

Lines changed: 8 additions & 15 deletions

File tree

Makefile

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,10 +40,6 @@ CC-MAIN-2024-22.warc.paths.gz:
4040
@echo "note that this file should be in the repo"
4141
aws s3 ls s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ | awk '{print $$4}' | gzip -9 > CC-MAIN-2024-22.warc.paths.gz
4242

43-
# duck_local_files:
44-
# @echo "warning! 300 gigabyte download"
45-
# python duck.py local_files
46-
#
4743
duck_ccf_local_files: build
4844
@echo "warning! only works on Common Crawl Foundadtion's development machine"
4945
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.Duck -Dexec.args"ccf_local_files"

README.md

Lines changed: 8 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -337,11 +337,9 @@ Make sure you compress WARCs the right way!
337337

338338
Some of our users only want to download a small subset of the crawl. They want to run queries against an index, either the CDX index we just talked about, or in the columnar index, which we'll talk about later.
339339

340-
The CDX index is documented [here](https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference) and can be accessed through a HTTP API.
340+
The CDX server API is documented [here](https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference) and can be accessed through a HTTP API.
341341

342-
There is a complete python package called
343-
344-
Right now there is no specific tool in Java for query the CDX index, nevertheless, we do have a very useful Python tool for working with the CDX index: [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit). Please refer to the Python Whirlwind Tour for more details.
342+
Right now there is no specific tool in Java for query the CDX index, nevertheless, we do have a very useful Python tool for working with the CDX index: [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit). Please refer to the [Python Whirlwind Tour](https://github.com/commoncrawl/whirlwind-python) for more details.
345343

346344
In this task we will achieve the same results using direct HTTP API calls and JWARC.
347345

@@ -526,18 +524,17 @@ org,wikipedia,an)/wiki/escopete 20240518015810 {"url":"https://an.wikipedia.org/
526524
```
527525
</details>
528526

529-
The above command runs code in `duck.py`, which accesses the relevant part of the index for our crawl (CC-MAIN-2024-22) and then counts the number of records in that crawl (2709877975!). The code runs the SQL query we saw before which should match the single response record we want.
527+
The above command runs code in `Duck.java`, which accesses the relevant part of the index for our crawl (CC-MAIN-2024-22) and then counts the number of records in that crawl (2709877975!). The code runs the SQL query we saw before which should match the single response record we want.
530528

531529
The program then writes that one record into a local Parquet file, does a second query that returns that one record, and shows the full contents of the record. We can see that the complete row contains many columns containing different information associated with our record. Finally, it converts the row to the CDXJ format we saw before.
532530

533531
### Bonus: download a full crawl index and query with DuckDB
534532

535533
If you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly. Run
536534

537-
```make duck_local_files```
538-
539-
If the files aren't already downloaded, this command will give you
540-
download instructions.
535+
```shell
536+
aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ .'
537+
```
541538
542539
(**Bonus bonus:** If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```.)
543540

@@ -547,12 +544,12 @@ All of these scripts run the same SQL query and should return the same record (w
547544

548545
1. Use the DuckDb techniques from [Task 8](#task-8-query-using-the-columnar-index--duckdb-from-outside-aws) and the [Index Server](https://index.commoncrawl.org) to find a new webpage in the archives.
549546
2. Note its url, warc, and timestamp.
550-
3. Now open up the Makefile from [Task 6](#task-6-use-cdx_toolkit-to-query-the-full-cdx-index-and-download-those-captures-from-aws-s3) and look at the actions from the cdx_toolkit section.
547+
3. Now open up the Makefile from [Task 6](#task-6-query-the-full-cdx-index-and-download-those-captures-from-aws-s3) and look at the actions from the cdx_toolkit section.
551548
4. Repeat the cdx_toolkit steps, but for the page and date range you found above.
552549

553550
## Congratulations!
554551

555-
You have completed the Whirlwind Tour of Common Crawl's Datasets using Python! You should now understand different filetypes we have in our corpus and how to interact with Common Crawl's datasets using Python. To see what other people have done with our data, see the [Examples page](https://commoncrawl.org/examples) on our website. Why not join our Discord through the Community tab?
552+
You have completed the Whirlwind Tour of Common Crawl's Datasets using Java! You should now understand different filetypes we have in our corpus and how to interact with Common Crawl's datasets using Java. To see what other people have done with our data, see the [Examples page](https://commoncrawl.org/examples) on our website. Why not join our Discord through the Community tab?
556553

557554

558555
## Other datasets

0 commit comments

Comments
 (0)