You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,6 +17,7 @@ flowchart TD
17
17
The goal of this whirlwind tour is to show you how a single webpage appears in all of these different places. That webpage is [https://an.wikipedia.org/wiki/Escopete](https://an.wikipedia.org/wiki/Escopete), which we crawled on the date 2024-05-18T01:58:10Z. On the way, we'll also explore the file formats we use and learn about some useful tools for interacting with our data!
18
18
19
19
In the Whirlwind Tour, we will:
20
+
20
21
1) explore the WARC, WET and WAT file formats used to store Common Crawl's data.
21
22
2) play with some useful Python packages for interacting with the data: [warcio](https://github.com/webrecorder/warcio), [cdxj-indexer](https://github.com/webrecorder/cdxj-indexer),
@@ -175,7 +176,7 @@ The output has three sections, one each for the WARC, WET, and WAT. For each one
175
176
176
177
## Task 3: Index the WARC, WET, and WAT
177
178
178
-
The example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index.
179
+
The example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index.
## Task 6: Use cdx_toolkit to query the full CDX index and download those captures from AWS S3
344
+
## Task 6: Use cdx_toolkit to query the full CDX index and download those captures
344
345
345
346
Some of our users only want to download a small subset of the crawl. They want to run queries against an index, either the CDX index we just talked about, or in the columnar index, which we'll talk about later.
346
347
@@ -400,7 +401,7 @@ Next, we use the `cdxt` command `warc` to retrieve the content and save it local
400
401
401
402
Finally, we run `cdxj-indexer` on this new WARC to make a CDXJ index of it as in Task 3, and then iterate over the WARC using `warcio-iterator.py` as in Task 2.
402
403
403
-
## Task 7: Find the right part of the columnar index
404
+
## Task 7: Find the right part of the columnar index
404
405
405
406
Now let's look at the columnar index, the other kind of index that Common Crawl makes available. This index is stored in parquet files so you can access it using SQL-based tools like AWS Athena and duckdb as well as through tables in your favorite table packages such as pandas, pyarrow, and polars.
0 commit comments