You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+5-244Lines changed: 5 additions & 244 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -194,80 +194,11 @@ The output has three sections, one each for the WARC, WET, and WAT. For each one
194
194
195
195
## Task 3: Index the WARC, WET, and WAT
196
196
197
-
The example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index.
198
-
```mermaid
199
-
flowchart LR
200
-
warc --> indexer --> cdxj & columnar
201
-
warc@{shape: cyl}
202
-
cdxj@{ shape: stored-data}
203
-
columnar@{ shape: stored-data}
204
-
```
205
-
206
-
207
-
We have two versions of the index: the CDX index and the columnar index. The CDX index is useful for looking up single pages, whereas the columnar index is better suited to analytical and bulk queries. We'll look at both in this tour, starting with the CDX index.
208
-
209
-
### CDX(J) index
210
-
211
-
**TBA**: Did not find a good java library that implements this feature, ideally can be implemented in jwarc
212
-
213
-
The CDX index files are sorted plain-text files, with each line containing information about a single capture in the WARC. Technically, Common Crawl uses CDXJ index files since the information about each capture is formatted as JSON. We'll use CDX and CDXJ interchangeably in this tour for legacy reasons 💅
214
-
215
-
We can create our own CDXJ index from the local WARCs by running:
216
-
217
-
```make cdxj```
218
-
219
-
This uses the [cdxj-indexer](https://github.com/webrecorder/cdxj-indexer) library to generate CDXJ index files for our WARC files by running the code below:
Now look at the `.cdxj` files with `cat whirlwind*.cdxj`. You'll see that each file has one entry in the index. The WARC only has the response record indexed, since by default cdxj-indexer guesses that you won't ever want to random-access the request or metadata. WET and WAT have the conversion and metadata records indexed (Common Crawl doesn't publish a WET or WAT index, just WARC).
234
-
235
-
For each of these records, there's one text line in the index - yes, it's a flat file! It starts with a string like `org,wikipedia,an)/wiki/escopete 20240518015810`, followed by a JSON blob. The starting string is the primary key of the index. The first thing is a [SURT](http://crawler.archive.org/articles/user_manual/glossary.html#surt) (Sort-friendly URI Reordering Transform). The big integer is a date, in ISO-8601 format with the delimiters removed.
236
-
237
-
What is the purpose of this funky format? It's done this way because these flat files (300 gigabytes total per crawl) can be sorted on the primary key using any out-of-core sort utility e.g. the standard Linux `sort`, or one of the Hadoop-based out-of-core sort functions.
238
-
239
-
The JSON blob has enough information to cleanly isolate the raw data of a single record: it defines which WARC file the record is in, and the byte offset and length of the record within this file. We'll use that in the next section.
197
+
TBA
240
198
241
199
## Task 4: Use the CDXJ index to extract a subset of raw content from the local WARC, WET, and WAT
242
200
243
-
Normally, compressed files aren't random access. However, the WARC files use a trick to make this possible, which is that every record needs to be separately compressed. The `gzip` compression utility supports this, but it's rarely used.
244
-
245
-
To extract one record from a warc file, all you need to know is the filename and the offset into the file. If you're reading over the web, then it really helps to know the exact length of the record.
246
-
247
-
Run:
248
-
249
-
```make extract```
250
-
251
-
to run a set of extractions from your local
252
-
`whirlwind.*.gz` files with `warcio` using the code below:
253
-
254
-
<details>
255
-
<summary>Click to view code</summary>
256
-
257
-
```
258
-
creating extraction.* from local warcs, the offset numbers are from the cdxj index
ones as in the index. Look at the three output files: `extraction.html`, `extraction.txt`, and `extraction.json` (pretty-print the json with `python -m json.tool extraction.json`).
269
-
270
-
Notice that we extracted HTML from the WARC, text from WET, and JSON from the WAT (as shown in the different file extensions). This is because the payload in each file type is formatted differently!
201
+
TBA
271
202
272
203
## Task 5: Wreck the WARC by compressing it wrong
273
204
@@ -337,63 +268,7 @@ Make sure you compress WARCs the right way!
337
268
338
269
## Task 6: Use cdx_toolkit to query the full CDX index and download those captures from AWS S3
339
270
340
-
Some of our users only want to download a small subset of the crawl. They want to run queries against an index, either the CDX index we just talked about, or in the columnar index, which we'll talk about later.
341
-
342
-
The [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit) is a set of tools for working with CDX indices of web crawls and archives. It knows how to query the CDX index across all of our crawls and also can create WARCs of just the records you want. We will fetch the same record from Wikipedia that we've been using for the whirlwind tour.
343
-
344
-
Run
345
-
346
-
```make cdx_toolkit```
347
-
348
-
The output looks like this:
349
-
350
-
<details>
351
-
<summary>Click to view output</summary>
352
-
353
-
```
354
-
demonstrate that we have this entry in the index
355
-
cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete
356
-
status 200, timestamp 20240518015810, url https://an.wikipedia.org/wiki/Escopete
357
-
358
-
cleanup previous work
359
-
rm -f TEST-000000.extracted.warc.gz
360
-
retrieve the content from the commoncrawl s3 bucket
There's a lot going on here so let's unpack it a little.
378
-
379
-
#### Check that the crawl has a record for the page we are interested in
380
-
381
-
We check for capture results using the `cdxt` command `iter`, specifying the exact URL `an.wikipedia.org/wiki/Escopete` and the timestamp range `--from 20240518015810 --to 20240518015810`. The result of this tells us that the crawl successfuly fetched this page at timestamp `20240518015810`.
382
-
* Captures are named by the surtkey and the time.
383
-
* Instead of `--crawl CC-MAIN-2024-22`, you could pass `--cc` to search across all crawls.
384
-
* You can pass `--limit <N>` to limit the number of results returned - in this case because we have restricted the timestamp range to a single value, we only expect one result.
385
-
* URLs may be specified with wildcards to return even more results: `"an.wikipedia.org/wiki/Escop*"` matches `an.wikipedia.org/wiki/Escopulión` and `an.wikipedia.org/wiki/Escopete`.
386
-
387
-
#### Retrieve the fetched content as WARC
388
-
389
-
Next, we use the `cdxt` command `warc` to retrieve the content and save it locally as a new WARC file, again specifying the exact URL, crawl identifier, and timestamp range. This creates the WARC file `TEST-000000.extracted.warc.gz` which contains a `warcinfo` record explaining what the WARC is, followed by the `response` record we requested.
390
-
* If you dig into cdx_toolkit's code, you'll find that it is using the offset and length of the WARC record (as returned by the CDX index query) to make a HTTP byte range request to S3 that isolates and returns just the single record we want from the full file. It only downloads the response WARC record because our CDX index only has the response records indexed.
391
-
* By default `cdxt` avoids overwriting existing files by automatically incrementing the counter in the filename. If you run this again without deleting `TEST-000000.extracted.warc.gz`, the data will be written again to a new file `TEST-000001.extracted.warc.gz`.
392
-
* Limit, timestamp, and crawl index args, as well as URL wildcards, work as for `iter`.
393
-
394
-
### Indexing the WARC and viewing its contents
395
-
396
-
Finally, we run `cdxj-indexer` on this new WARC to make a CDXJ index of it as in Task 3, and then iterate over the WARC using `warcio-iterator.py` as in Task 2.
271
+
TBA
397
272
398
273
## Task 7: Find the right part of the columnar index
399
274
@@ -431,125 +306,11 @@ The date of our test record is 20240518015810, which is
431
306
432
307
## Task 8: Query using the columnar index + DuckDB from outside AWS
433
308
434
-
A single crawl columnar index is around 300 gigabytes. If you don't have a lot of disk space, but you do have a lot of time, you can directly access the index stored on AWS S3. We're going to do just that, and then use [DuckDB](https://duckdb.org) to make an SQL query against the index to find our webpage. We'll be running the following query:
435
-
436
-
```sql
437
-
SELECT
438
-
*
439
-
FROM ccindex
440
-
WHERE subset ='warc'
441
-
AND crawl ='CC-MAIN-2024-22'
442
-
AND url_host_tld ='org'-- help the query optimizer
443
-
AND url_host_registered_domain ='wikipedia.org'-- ditto
444
-
AND url ='https://an.wikipedia.org/wiki/Escopete'
445
-
;
446
-
```
447
-
448
-
Run
449
-
450
-
```make duck_cloudfront```
451
-
452
-
On a machine with a 1 gigabit network connection and many cores, this should take about one minute total, and uses 8 cores. The output should look like:
The above command runs code in `duck.py`, which accesses the relevant part of the index for our crawl (CC-MAIN-2024-22) and then counts the number of records in that crawl (2709877975!). The code runs the SQL query we saw before which should match the single response record we want.
538
-
539
-
The program then writes that one record into a local Parquet file, does a second query that returns that one record, and shows the full contents of the record. We can see that the complete row contains many columns containing different information associated with our record. Finally, it converts the row to the CDXJ format we saw before.
309
+
TBA
540
310
541
311
### Bonus: download a full crawl index and query with DuckDB
542
312
543
-
If you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly. Run
544
-
545
-
```make duck_local_files```
546
-
547
-
If the files aren't already downloaded, this command will give you
548
-
download instructions.
549
-
550
-
(**Bonus bonus:** If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```.)
551
-
552
-
All of these scripts run the same SQL query and should return the same record (written as a parquet file).
0 commit comments