Skip to content

Commit da9c01e

Browse files
authored
doc: improve readability of readme (#6)
1 parent a7b2837 commit da9c01e

1 file changed

Lines changed: 9 additions & 9 deletions

File tree

README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Ready? Here we go!
3333
## Look at the example files in an editor
3434

3535
WARC files are a container that holds files, similar to zip and tar files.
36-
Open up whirlwind.warc in your favorite text editor. This is the uncompressed
36+
Open up `whirlwind.warc` in your favorite text editor. This is the uncompressed
3737
version of the file -- normally we always work with these files while they
3838
are compressed.
3939

@@ -47,13 +47,13 @@ with its http headers; the response from the webserver, with its http
4747
headers followed by the html; and finally a metadata record related to
4848
that response.
4949

50-
Now let's look at whirlwind.warc.wet -- which is in WARC format, but
50+
Now let's look at `whirlwind.warc.wet` -- which is in WARC format, but
5151
the thing stored in the record is the extracted text from the html.
5252
There's a warcinfo record at the start, and then just one record
5353
relating to the webpage. It's a "conversion" record: it does not
5454
have any http headers, it's just the extracted text.
5555

56-
Finally, open up whirlwind.warc.wat -- also in WARC format. This file
56+
Finally, open up `whirlwind.warc.wat` -- also in WARC format. This file
5757
contains a metadata record for each response in the warc. The metadata
5858
is stored as json. You might want to feed this json into a
5959
pretty-printer to read it more easily. For example, you can save just
@@ -76,7 +76,7 @@ installed with `brew install awscli`. You'll also need virtualenv,
7676
## Set up a virtual environment
7777

7878
It's a good idea to set up completely separate environments for Python
79-
project, where you can install things without either changing the
79+
projects, where you can install things without either changing the
8080
system Python environment, or any of your other Python projects.
8181

8282
If you already have your own favorite virtual environment scheme, you
@@ -182,7 +182,7 @@ cdxj-indexer --records conversion whirlwind.warc.wet.gz > whirlwind.warc.wet.cdx
182182
cdxj-indexer whirlwind.warc.wat.gz > whirlwind.warc.wat.cdxj
183183
```
184184

185-
Now look at the .cdxj files with `cat whirlwind*.cdxj`. You'll see
185+
Now look at the `.cdxj` files with `cat whirlwind*.cdxj`. You'll see
186186
that each file has one entry in the index. The warc only has the
187187
response record indexed -- by default cdxj-indexer guesses that you
188188
won't ever want to random-access the request or metadata. wet and wat
@@ -226,7 +226,7 @@ Run:
226226
```make extract```
227227

228228
to run a set of extractions from your local
229-
whirlwind.*.gz files.
229+
`whirlwind.*.gz` files.
230230

231231
```
232232
creating extraction.* from local warcs, the offset numbers are from the cdxj index
@@ -283,7 +283,7 @@ python ./warcio-iterator.py TEST-000000.extracted.warc.gz
283283
The command lines for these `cdxt` commands specifies the exact URL
284284
we've been using all along, and the particular date of its
285285
capture, 20240518015810. The output is a warc file
286-
TEST-000000.extracted.warc.gz, with this one record plus a warcinfo
286+
`TEST-000000.extracted.warc.gz`, with this one record plus a warcinfo
287287
record explaining what this warc is. The Makefile target also runs
288288
cdxj-indexer on this new warc, and iterates over it.
289289

@@ -322,8 +322,8 @@ all of the crawl indexes, because it would be slow. So let's start by
322322
figuring out which crawl was ongoing on the date 20240518015810, and
323323
then we'll work with just that one crawl.
324324

325-
To find the crawl name, download the file collinfo.json from
326-
index.commoncrawl.org. It includes the dates for the the start and end
325+
To find the crawl name, download the file `collinfo.json` from
326+
[index.commoncrawl.org](https://index.commoncrawl.org). It includes the dates for the the start and end
327327
of every crawl.
328328

329329
Run

0 commit comments

Comments
 (0)