Skip to content

Commit 0edfbc2

Browse files
committed
remove non-implemented part - to avoid confusion
1 parent 6a4a124 commit 0edfbc2

1 file changed

Lines changed: 5 additions & 244 deletions

File tree

README.md

Lines changed: 5 additions & 244 deletions
Original file line numberDiff line numberDiff line change
@@ -194,80 +194,11 @@ The output has three sections, one each for the WARC, WET, and WAT. For each one
194194

195195
## Task 3: Index the WARC, WET, and WAT
196196

197-
The example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index.
198-
```mermaid
199-
flowchart LR
200-
warc --> indexer --> cdxj & columnar
201-
warc@{shape: cyl}
202-
cdxj@{ shape: stored-data}
203-
columnar@{ shape: stored-data}
204-
```
205-
206-
207-
We have two versions of the index: the CDX index and the columnar index. The CDX index is useful for looking up single pages, whereas the columnar index is better suited to analytical and bulk queries. We'll look at both in this tour, starting with the CDX index.
208-
209-
### CDX(J) index
210-
211-
**TBA**: Did not find a good java library that implements this feature, ideally can be implemented in jwarc
212-
213-
The CDX index files are sorted plain-text files, with each line containing information about a single capture in the WARC. Technically, Common Crawl uses CDXJ index files since the information about each capture is formatted as JSON. We'll use CDX and CDXJ interchangeably in this tour for legacy reasons 💅
214-
215-
We can create our own CDXJ index from the local WARCs by running:
216-
217-
```make cdxj```
218-
219-
This uses the [cdxj-indexer](https://github.com/webrecorder/cdxj-indexer) library to generate CDXJ index files for our WARC files by running the code below:
220-
221-
<details>
222-
<summary>Click to view code</summary>
223-
224-
```
225-
creating *.cdxj index files from the local warcs
226-
cdxj-indexer whirlwind.warc.gz > whirlwind.warc.cdxj
227-
cdxj-indexer --records conversion whirlwind.warc.wet.gz > whirlwind.warc.wet.cdxj
228-
cdxj-indexer whirlwind.warc.wat.gz > whirlwind.warc.wat.cdxj
229-
```
230-
231-
</details>
232-
233-
Now look at the `.cdxj` files with `cat whirlwind*.cdxj`. You'll see that each file has one entry in the index. The WARC only has the response record indexed, since by default cdxj-indexer guesses that you won't ever want to random-access the request or metadata. WET and WAT have the conversion and metadata records indexed (Common Crawl doesn't publish a WET or WAT index, just WARC).
234-
235-
For each of these records, there's one text line in the index - yes, it's a flat file! It starts with a string like `org,wikipedia,an)/wiki/escopete 20240518015810`, followed by a JSON blob. The starting string is the primary key of the index. The first thing is a [SURT](http://crawler.archive.org/articles/user_manual/glossary.html#surt) (Sort-friendly URI Reordering Transform). The big integer is a date, in ISO-8601 format with the delimiters removed.
236-
237-
What is the purpose of this funky format? It's done this way because these flat files (300 gigabytes total per crawl) can be sorted on the primary key using any out-of-core sort utility e.g. the standard Linux `sort`, or one of the Hadoop-based out-of-core sort functions.
238-
239-
The JSON blob has enough information to cleanly isolate the raw data of a single record: it defines which WARC file the record is in, and the byte offset and length of the record within this file. We'll use that in the next section.
197+
TBA
240198

241199
## Task 4: Use the CDXJ index to extract a subset of raw content from the local WARC, WET, and WAT
242200

243-
Normally, compressed files aren't random access. However, the WARC files use a trick to make this possible, which is that every record needs to be separately compressed. The `gzip` compression utility supports this, but it's rarely used.
244-
245-
To extract one record from a warc file, all you need to know is the filename and the offset into the file. If you're reading over the web, then it really helps to know the exact length of the record.
246-
247-
Run:
248-
249-
```make extract```
250-
251-
to run a set of extractions from your local
252-
`whirlwind.*.gz` files with `warcio` using the code below:
253-
254-
<details>
255-
<summary>Click to view code</summary>
256-
257-
```
258-
creating extraction.* from local warcs, the offset numbers are from the cdxj index
259-
warcio extract --payload whirlwind.warc.gz 1023 > extraction.html
260-
warcio extract --payload whirlwind.warc.wet.gz 466 > extraction.txt
261-
warcio extract --payload whirlwind.warc.wat.gz 443 > extraction.json
262-
hint: python -m json.tool extraction.json
263-
```
264-
265-
</details>
266-
267-
The offset numbers in the Makefile are the same
268-
ones as in the index. Look at the three output files: `extraction.html`, `extraction.txt`, and `extraction.json` (pretty-print the json with `python -m json.tool extraction.json`).
269-
270-
Notice that we extracted HTML from the WARC, text from WET, and JSON from the WAT (as shown in the different file extensions). This is because the payload in each file type is formatted differently!
201+
TBA
271202

272203
## Task 5: Wreck the WARC by compressing it wrong
273204

@@ -337,63 +268,7 @@ Make sure you compress WARCs the right way!
337268

338269
## Task 6: Use cdx_toolkit to query the full CDX index and download those captures from AWS S3
339270

340-
Some of our users only want to download a small subset of the crawl. They want to run queries against an index, either the CDX index we just talked about, or in the columnar index, which we'll talk about later.
341-
342-
The [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit) is a set of tools for working with CDX indices of web crawls and archives. It knows how to query the CDX index across all of our crawls and also can create WARCs of just the records you want. We will fetch the same record from Wikipedia that we've been using for the whirlwind tour.
343-
344-
Run
345-
346-
```make cdx_toolkit```
347-
348-
The output looks like this:
349-
350-
<details>
351-
<summary>Click to view output</summary>
352-
353-
```
354-
demonstrate that we have this entry in the index
355-
cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete
356-
status 200, timestamp 20240518015810, url https://an.wikipedia.org/wiki/Escopete
357-
358-
cleanup previous work
359-
rm -f TEST-000000.extracted.warc.gz
360-
retrieve the content from the commoncrawl s3 bucket
361-
cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 warc an.wikipedia.org/wiki/Escopete
362-
363-
index this new warc
364-
cdxj-indexer TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj
365-
cat TEST-000000.extracted.warc.cdxj
366-
org,wikipedia,an)/wiki/escopete 20240518015810 {"url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "status": "200", "digest": "sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17455", "offset": "406", "filename": "TEST-000000.extracted.warc.gz"}
367-
368-
iterate this new warc
369-
python ./warcio-iterator.py TEST-000000.extracted.warc.gz
370-
WARC-Type: warcinfo
371-
WARC-Type: response
372-
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
373-
```
374-
375-
</details>
376-
377-
There's a lot going on here so let's unpack it a little.
378-
379-
#### Check that the crawl has a record for the page we are interested in
380-
381-
We check for capture results using the `cdxt` command `iter`, specifying the exact URL `an.wikipedia.org/wiki/Escopete` and the timestamp range `--from 20240518015810 --to 20240518015810`. The result of this tells us that the crawl successfuly fetched this page at timestamp `20240518015810`.
382-
* Captures are named by the surtkey and the time.
383-
* Instead of `--crawl CC-MAIN-2024-22`, you could pass `--cc` to search across all crawls.
384-
* You can pass `--limit <N>` to limit the number of results returned - in this case because we have restricted the timestamp range to a single value, we only expect one result.
385-
* URLs may be specified with wildcards to return even more results: `"an.wikipedia.org/wiki/Escop*"` matches `an.wikipedia.org/wiki/Escopulión` and `an.wikipedia.org/wiki/Escopete`.
386-
387-
#### Retrieve the fetched content as WARC
388-
389-
Next, we use the `cdxt` command `warc` to retrieve the content and save it locally as a new WARC file, again specifying the exact URL, crawl identifier, and timestamp range. This creates the WARC file `TEST-000000.extracted.warc.gz` which contains a `warcinfo` record explaining what the WARC is, followed by the `response` record we requested.
390-
* If you dig into cdx_toolkit's code, you'll find that it is using the offset and length of the WARC record (as returned by the CDX index query) to make a HTTP byte range request to S3 that isolates and returns just the single record we want from the full file. It only downloads the response WARC record because our CDX index only has the response records indexed.
391-
* By default `cdxt` avoids overwriting existing files by automatically incrementing the counter in the filename. If you run this again without deleting `TEST-000000.extracted.warc.gz`, the data will be written again to a new file `TEST-000001.extracted.warc.gz`.
392-
* Limit, timestamp, and crawl index args, as well as URL wildcards, work as for `iter`.
393-
394-
### Indexing the WARC and viewing its contents
395-
396-
Finally, we run `cdxj-indexer` on this new WARC to make a CDXJ index of it as in Task 3, and then iterate over the WARC using `warcio-iterator.py` as in Task 2.
271+
TBA
397272

398273
## Task 7: Find the right part of the columnar index
399274

@@ -431,125 +306,11 @@ The date of our test record is 20240518015810, which is
431306

432307
## Task 8: Query using the columnar index + DuckDB from outside AWS
433308

434-
A single crawl columnar index is around 300 gigabytes. If you don't have a lot of disk space, but you do have a lot of time, you can directly access the index stored on AWS S3. We're going to do just that, and then use [DuckDB](https://duckdb.org) to make an SQL query against the index to find our webpage. We'll be running the following query:
435-
436-
```sql
437-
SELECT
438-
*
439-
FROM ccindex
440-
WHERE subset = 'warc'
441-
AND crawl = 'CC-MAIN-2024-22'
442-
AND url_host_tld = 'org' -- help the query optimizer
443-
AND url_host_registered_domain = 'wikipedia.org' -- ditto
444-
AND url = 'https://an.wikipedia.org/wiki/Escopete'
445-
;
446-
```
447-
448-
Run
449-
450-
```make duck_cloudfront```
451-
452-
On a machine with a 1 gigabit network connection and many cores, this should take about one minute total, and uses 8 cores. The output should look like:
453-
454-
<details>
455-
<summary>Click to view output</summary>
456-
457-
```
458-
warning! this might take 1-10 minutes
459-
python duck.py cloudfront
460-
total records for crawl: CC-MAIN-2024-22
461-
┌──────────────┐
462-
│ count_star() │
463-
│ int64 │
464-
├──────────────┤
465-
│ 2709877975 │
466-
└──────────────┘
467-
468-
our one row
469-
┌──────────────────────┬──────────────────────┬──────────────────┬───┬──────────────────┬─────────────────┬─────────┐
470-
│ url_surtkey │ url │ url_host_name │ … │ warc_segment │ crawl │ subset │
471-
│ varchar │ varchar │ varchar │ │ varchar │ varchar │ varchar │
472-
├──────────────────────┼──────────────────────┼──────────────────┼───┼──────────────────┼─────────────────┼─────────┤
473-
│ org,wikipedia,an)/… │ https://an.wikiped… │ an.wikipedia.org │ … │ 1715971057216.39 │ CC-MAIN-2024-22 │ warc │
474-
├──────────────────────┴──────────────────────┴──────────────────┴───┴──────────────────┴─────────────────┴─────────┤
475-
│ 1 rows 32 columns (6 shown) │
476-
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
477-
478-
writing our one row to a local parquet file, whirlwind.parquet
479-
total records for local whirlwind.parquet should be 1
480-
┌──────────────┐
481-
│ count_star() │
482-
│ int64 │
483-
├──────────────┤
484-
│ 1 │
485-
└──────────────┘
486-
487-
our one row, locally
488-
┌──────────────────────┬──────────────────────┬──────────────────┬───┬──────────────────┬─────────────────┬─────────┐
489-
│ url_surtkey │ url │ url_host_name │ … │ warc_segment │ crawl │ subset │
490-
│ varchar │ varchar │ varchar │ │ varchar │ varchar │ varchar │
491-
├──────────────────────┼──────────────────────┼──────────────────┼───┼──────────────────┼─────────────────┼─────────┤
492-
│ org,wikipedia,an)/… │ https://an.wikiped… │ an.wikipedia.org │ … │ 1715971057216.39 │ CC-MAIN-2024-22 │ warc │
493-
├──────────────────────┴──────────────────────┴──────────────────┴───┴──────────────────┴─────────────────┴─────────┤
494-
│ 1 rows 32 columns (6 shown) │
495-
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
496-
497-
complete row:
498-
url_surtkey org,wikipedia,an)/wiki/escopete
499-
url https://an.wikipedia.org/wiki/Escopete
500-
url_host_name an.wikipedia.org
501-
url_host_tld org
502-
url_host_2nd_last_part wikipedia
503-
url_host_3rd_last_part an
504-
url_host_4th_last_part None
505-
url_host_5th_last_part None
506-
url_host_registry_suffix org
507-
url_host_registered_domain wikipedia.org
508-
url_host_private_suffix org
509-
url_host_private_domain wikipedia.org
510-
url_host_name_reversed org.wikipedia.an
511-
url_protocol https
512-
url_port nan
513-
url_path /wiki/Escopete
514-
url_query None
515-
fetch_time 2024-05-18 01:58:10+00:00
516-
fetch_status 200
517-
fetch_redirect None
518-
content_digest RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU
519-
content_mime_type text/html
520-
content_mime_detected text/html
521-
content_charset UTF-8
522-
content_languages spa
523-
content_truncated None
524-
warc_filename crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz
525-
warc_record_offset 80610731
526-
warc_record_length 17423
527-
warc_segment 1715971057216.39
528-
crawl CC-MAIN-2024-22
529-
subset warc
530-
531-
equivalent to cdxj:
532-
org,wikipedia,an)/wiki/escopete 20240518015810 {"url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "status": "200", "digest": "sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17423", "offset": "80610731", "filename": "crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz"}
533-
534-
```
535-
</details>
536-
537-
The above command runs code in `duck.py`, which accesses the relevant part of the index for our crawl (CC-MAIN-2024-22) and then counts the number of records in that crawl (2709877975!). The code runs the SQL query we saw before which should match the single response record we want.
538-
539-
The program then writes that one record into a local Parquet file, does a second query that returns that one record, and shows the full contents of the record. We can see that the complete row contains many columns containing different information associated with our record. Finally, it converts the row to the CDXJ format we saw before.
309+
TBA
540310

541311
### Bonus: download a full crawl index and query with DuckDB
542312

543-
If you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly. Run
544-
545-
```make duck_local_files```
546-
547-
If the files aren't already downloaded, this command will give you
548-
download instructions.
549-
550-
(**Bonus bonus:** If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```.)
551-
552-
All of these scripts run the same SQL query and should return the same record (written as a parquet file).
313+
TBA
553314

554315
## Bonus 2: combine some steps
555316

0 commit comments

Comments
 (0)