@@ -33,7 +33,7 @@ Ready? Here we go!
3333## Look at the example files in an editor
3434
3535WARC files are a container that holds files, similar to zip and tar files.
36- Open up whirlwind.warc in your favorite text editor. This is the uncompressed
36+ Open up ` whirlwind.warc ` in your favorite text editor. This is the uncompressed
3737version of the file -- normally we always work with these files while they
3838are compressed.
3939
@@ -47,13 +47,13 @@ with its http headers; the response from the webserver, with its http
4747headers followed by the html; and finally a metadata record related to
4848that response.
4949
50- Now let's look at whirlwind.warc.wet -- which is in WARC format, but
50+ Now let's look at ` whirlwind.warc.wet ` -- which is in WARC format, but
5151the thing stored in the record is the extracted text from the html.
5252There's a warcinfo record at the start, and then just one record
5353relating to the webpage. It's a "conversion" record: it does not
5454have any http headers, it's just the extracted text.
5555
56- Finally, open up whirlwind.warc.wat -- also in WARC format. This file
56+ Finally, open up ` whirlwind.warc.wat ` -- also in WARC format. This file
5757contains a metadata record for each response in the warc. The metadata
5858is stored as json. You might want to feed this json into a
5959pretty-printer to read it more easily. For example, you can save just
@@ -76,7 +76,7 @@ installed with `brew install awscli`. You'll also need virtualenv,
7676## Set up a virtual environment
7777
7878It's a good idea to set up completely separate environments for Python
79- project , where you can install things without either changing the
79+ projects , where you can install things without either changing the
8080system Python environment, or any of your other Python projects.
8181
8282If you already have your own favorite virtual environment scheme, you
@@ -182,7 +182,7 @@ cdxj-indexer --records conversion whirlwind.warc.wet.gz > whirlwind.warc.wet.cdx
182182cdxj-indexer whirlwind.warc.wat.gz > whirlwind.warc.wat.cdxj
183183```
184184
185- Now look at the .cdxj files with ` cat whirlwind*.cdxj ` . You'll see
185+ Now look at the ` .cdxj ` files with ` cat whirlwind*.cdxj ` . You'll see
186186that each file has one entry in the index. The warc only has the
187187response record indexed -- by default cdxj-indexer guesses that you
188188won't ever want to random-access the request or metadata. wet and wat
226226``` make extract ```
227227
228228to run a set of extractions from your local
229- whirlwind.* .gz files.
229+ ` whirlwind.*.gz ` files.
230230
231231```
232232creating extraction.* from local warcs, the offset numbers are from the cdxj index
@@ -283,7 +283,7 @@ python ./warcio-iterator.py TEST-000000.extracted.warc.gz
283283The command lines for these ` cdxt ` commands specifies the exact URL
284284we've been using all along, and the particular date of its
285285capture, 20240518015810. The output is a warc file
286- TEST-000000.extracted.warc.gz, with this one record plus a warcinfo
286+ ` TEST-000000.extracted.warc.gz ` , with this one record plus a warcinfo
287287record explaining what this warc is. The Makefile target also runs
288288cdxj-indexer on this new warc, and iterates over it.
289289
@@ -322,8 +322,8 @@ all of the crawl indexes, because it would be slow. So let's start by
322322figuring out which crawl was ongoing on the date 20240518015810, and
323323then we'll work with just that one crawl.
324324
325- To find the crawl name, download the file collinfo.json from
326- index.commoncrawl.org. It includes the dates for the the start and end
325+ To find the crawl name, download the file ` collinfo.json ` from
326+ [ index.commoncrawl.org] ( https://index.commoncrawl.org ) . It includes the dates for the the start and end
327327of every crawl.
328328
329329Run
0 commit comments