Skip to content

Commit e38a0ce

Browse files
committed
feat: task 6
1 parent a716f7d commit e38a0ce

2 files changed

Lines changed: 50 additions & 37 deletions

File tree

Makefile

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -14,23 +14,23 @@ extract:
1414
java -jar jwarc.jar extract --payload data/whirlwind.warc.wat.gz 443 > extraction.json
1515
@echo "hint: python -m json.tool extraction.json"
1616

17-
# cdx_toolkit:
18-
# @echo demonstrate that we have this entry in the index
19-
# cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete
20-
# @echo
21-
# @echo cleanup previous work
22-
# rm -f TEST-000000.extracted.warc.gz
23-
# @echo retrieve the content from the commoncrawl s3 bucket
24-
# cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 warc an.wikipedia.org/wiki/Escopete
25-
# @echo
26-
# @echo index this new warc
27-
# cdxj-indexer TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj
28-
# cat TEST-000000.extracted.warc.cdxj
29-
# @echo
30-
# @echo iterate this new warc
31-
# python ./warcio-iterator.py TEST-000000.extracted.warc.gz
32-
# @echo
33-
#
17+
cdx_toolkit:
18+
@echo demonstrate that we have this entry in the index
19+
curl 'https://index.commoncrawl.org/CC-MAIN-2024-22-index?url=an.wikipedia.org/wiki/Escopete&output=json&from=20240518015810&to=20240518015810'
20+
@echo
21+
@echo cleanup previous work
22+
rm -f TEST-000000.extracted.warc.gz
23+
@echo retrieve the content from the commoncrawl data server
24+
curl --request GET --url 'https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz' --header 'Range: bytes=80610731-80628153' > TEST-000000.extracted.warc.gz
25+
@echo
26+
@echo index this new warc
27+
java -jar jwarc.jar cdxj TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj
28+
cat TEST-000000.extracted.warc.cdxj
29+
@echo
30+
@echo iterate this new warc
31+
java -jar jwarc.jar ls TEST-000000.extracted.warc.gz
32+
@echo
33+
3434
download_collinfo:
3535
@echo "downloading collinfo.json so we can find out the crawl name"
3636
curl -O https://index.commoncrawl.org/collinfo.json

README.md

Lines changed: 33 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -333,15 +333,21 @@ mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.ReadWARC -Dexec.args
333333

334334
Make sure you compress WARCs the right way!
335335

336-
## Task 6: Use cdx_toolkit to query the full CDX index and download those captures from AWS S3
336+
## Task 6: Query the full CDX index and download those captures from AWS S3
337337

338338
Some of our users only want to download a small subset of the crawl. They want to run queries against an index, either the CDX index we just talked about, or in the columnar index, which we'll talk about later.
339339

340-
The [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit) is a set of tools for working with CDX indices of web crawls and archives. It knows how to query the CDX index across all of our crawls and also can create WARCs of just the records you want. We will fetch the same record from Wikipedia that we've been using for the whirlwind tour.
340+
The CDX index is documented [here](https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference) and can be accessed through a HTTP API.
341+
342+
There is a complete python package called
343+
344+
Right now there is no specific tool in Java for query the CDX index, nevertheless, we do have a very useful Python tool for working with the CDX index: [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit). Please refer to the Python Whirlwind Tour for more details.
345+
346+
In this task we will achieve the same results using direct HTTP API calls and JWARC.
341347

342348
Run
343349

344-
```make cdx_toolkit```
350+
```make query_cdx```
345351

346352
The output looks like this:
347353

@@ -350,24 +356,25 @@ The output looks like this:
350356

351357
```
352358
demonstrate that we have this entry in the index
353-
cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete
354-
status 200, timestamp 20240518015810, url https://an.wikipedia.org/wiki/Escopete
359+
curl https://index.commoncrawl.org/CC-MAIN-2024-22-index?url=an.wikipedia.org/wiki/Escopete&output=json&from=20240518015810&to=20240518015810
360+
361+
{"urlkey": "org,wikipedia,an)/wiki/escopete", "timestamp": "20240518015810", "url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17423", "offset": "80610731", "filename": "crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz", "languages": "spa", "encoding": "UTF-8"}
355362
356363
cleanup previous work
357364
rm -f TEST-000000.extracted.warc.gz
358-
retrieve the content from the commoncrawl s3 bucket
359-
cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 warc an.wikipedia.org/wiki/Escopete
365+
retrieve the content from the commoncrawl s3 bucket (offset: 80628153 = 80610731 + 17423 - 1)
366+
curl --request GET \
367+
--url https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz \
368+
--header 'Range: bytes=80610731-80628153' > TEST-000000.extracted.warc.gz
360369
361370
index this new warc
362-
cdxj-indexer TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj
371+
java -jar jwarc.jar cdxj TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj
363372
cat TEST-000000.extracted.warc.cdxj
364373
org,wikipedia,an)/wiki/escopete 20240518015810 {"url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "status": "200", "digest": "sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17455", "offset": "406", "filename": "TEST-000000.extracted.warc.gz"}
365374
366375
iterate this new warc
367-
python ./warcio-iterator.py TEST-000000.extracted.warc.gz
368-
WARC-Type: warcinfo
369-
WARC-Type: response
370-
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
376+
java -jar jwarc.jar ls TEST-000000.extracted.warc.gz
377+
0 response 200 https://an.wikipedia.org/wiki/Escopete
371378
```
372379

373380
</details>
@@ -376,22 +383,28 @@ There's a lot going on here so let's unpack it a little.
376383

377384
#### Check that the crawl has a record for the page we are interested in
378385

379-
We check for capture results using the `cdxt` command `iter`, specifying the exact URL `an.wikipedia.org/wiki/Escopete` and the timestamp range `--from 20240518015810 --to 20240518015810`. The result of this tells us that the crawl successfuly fetched this page at timestamp `20240518015810`.
386+
We check for capture results querying the index.commoncrawl.org with GET parameters, specifying the crawl (`CC-MAIN-2024-22-index`), the exact URL `an.wikipedia.org/wiki/Escopete` and the timestamp range `from=20240518015810` and `to=20240518015810`.
387+
The result of this tells us that the crawl successfully fetched this page at timestamp `20240518015810`.
380388
* Captures are named by the surtkey and the time.
381-
* Instead of `--crawl CC-MAIN-2024-22`, you could pass `--cc` to search across all crawls.
382-
* You can pass `--limit <N>` to limit the number of results returned - in this case because we have restricted the timestamp range to a single value, we only expect one result.
389+
390+
[//]: # (* If you need to search across all crawls, of `--crawl CC-MAIN-2024-22`, you could pass `--cc` to search across all crawls.)
391+
[//]: # (Here I'm tempted to mention that you should use the columnar index for this kind of operations, however cdx_toolkit iterate over all crawls when called with -cc, if I'm not wrong)
392+
* You can use the parameter `limit=<N>` to limit the number of results returned - in this case because we have restricted the timestamp range to a single value, we only expect one result.
383393
* URLs may be specified with wildcards to return even more results: `"an.wikipedia.org/wiki/Escop*"` matches `an.wikipedia.org/wiki/Escopulión` and `an.wikipedia.org/wiki/Escopete`.
384394

385395
#### Retrieve the fetched content as WARC
386396

387-
Next, we use the `cdxt` command `warc` to retrieve the content and save it locally as a new WARC file, again specifying the exact URL, crawl identifier, and timestamp range. This creates the WARC file `TEST-000000.extracted.warc.gz` which contains a `warcinfo` record explaining what the WARC is, followed by the `response` record we requested.
388-
* If you dig into cdx_toolkit's code, you'll find that it is using the offset and length of the WARC record (as returned by the CDX index query) to make a HTTP byte range request to S3 that isolates and returns just the single record we want from the full file. It only downloads the response WARC record because our CDX index only has the response records indexed.
389-
* By default `cdxt` avoids overwriting existing files by automatically incrementing the counter in the filename. If you run this again without deleting `TEST-000000.extracted.warc.gz`, the data will be written again to a new file `TEST-000001.extracted.warc.gz`.
390-
* Limit, timestamp, and crawl index args, as well as URL wildcards, work as for `iter`.
397+
Next, we make another HTTP call to retrieve the content and save it locally as a new WARC file, again specifying the exact URL, crawl identifier, and timestamp range.
398+
This creates the WARC file `TEST-000000.extracted.warc.gz`
399+
400+
[//]: # (Here there is no warcinfo when getting from data.commoncrawl.org, right?)
401+
[//]: # (which contains a `warcinfo` record explaining what the WARC is, followed by the `response` record we requested. )
402+
* If you check the cURL command, you'll find that it is using the offset and length of the WARC record (as returned by the CDX index query) to make an HTTP byte range request to `data.commoncrawl.org` that isolates and returns just the single record we want from the full file. It only downloads the response WARC record because our CDX index only has the response records indexed.
403+
* Limit, timestamp, and crawl index parameters, as well as URL wildcards.
391404

392405
### Indexing the WARC and viewing its contents
393406

394-
Finally, we run `cdxj-indexer` on this new WARC to make a CDXJ index of it as in Task 3, and then iterate over the WARC using `warcio-iterator.py` as in Task 2.
407+
Finally, we run `jwarc cdxj` that process the WARC to make a CDXJ index of it as in Task 3, and then list the records using `jwarc ls` as in Task 2.
395408

396409
## Task 7: Find the right part of the columnar index
397410

0 commit comments

Comments
 (0)