Skip to content

Commit e55c48a

Browse files
committed
feat: Task 8, duck DB with local file
1 parent b3c7252 commit e55c48a

5 files changed

Lines changed: 376 additions & 58 deletions

File tree

CC-MAIN-2024-22.warc.paths.gz

817 Bytes
Binary file not shown.

Makefile

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -44,14 +44,15 @@ CC-MAIN-2024-22.warc.paths.gz:
4444
# @echo "warning! 300 gigabyte download"
4545
# python duck.py local_files
4646
#
47-
# duck_ccf_local_files:
48-
# @echo "warning! only works on Common Crawl Foundadtion's development machine"
49-
# python duck.py ccf_local_files
50-
#
51-
# duck_cloudfront:
52-
# @echo "warning! this might take 1-10 minutes"
53-
# python duck.py cloudfront
54-
#
47+
duck_ccf_local_files: build
48+
@echo "warning! only works on Common Crawl Foundadtion's development machine"
49+
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.Duck -Dexec.args"ccf_local_files"
50+
51+
duck_cloudfront: build
52+
@echo "warning! this might take 1-10 minutes"
53+
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.Duck -Dexec.args"cloudfront"
54+
55+
5556
ensure_jwarc:
5657
@echo "Ensuring JWarc JAR is present"
5758
@if [ ! -f jwarc.jar ] ; then \

README.md

Lines changed: 31 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -453,82 +453,63 @@ On a machine with a 1 gigabit network connection and many cores, this should tak
453453
<summary>Click to view output</summary>
454454

455455
```
456-
warning! this might take 1-10 minutes
457-
python duck.py cloudfront
458-
total records for crawl: CC-MAIN-2024-22
459-
┌──────────────┐
460-
│ count_star() │
461-
│ int64 │
462-
├──────────────┤
463-
│ 2709877975 │
464-
└──────────────┘
465-
466-
our one row
467-
┌──────────────────────┬──────────────────────┬──────────────────┬───┬──────────────────┬─────────────────┬─────────┐
468-
│ url_surtkey │ url │ url_host_name │ … │ warc_segment │ crawl │ subset │
469-
│ varchar │ varchar │ varchar │ │ varchar │ varchar │ varchar │
470-
├──────────────────────┼──────────────────────┼──────────────────┼───┼──────────────────┼─────────────────┼─────────┤
471-
│ org,wikipedia,an)/… │ https://an.wikiped… │ an.wikipedia.org │ … │ 1715971057216.39 │ CC-MAIN-2024-22 │ warc │
472-
├──────────────────────┴──────────────────────┴──────────────────┴───┴──────────────────┴─────────────────┴─────────┤
473-
│ 1 rows 32 columns (6 shown) │
474-
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
475-
476-
writing our one row to a local parquet file, whirlwind.parquet
477-
total records for local whirlwind.parquet should be 1
478-
┌──────────────┐
479-
│ count_star() │
480-
│ int64 │
481-
├──────────────┤
482-
│ 1 │
483-
└──────────────┘
484-
485-
our one row, locally
486-
┌──────────────────────┬──────────────────────┬──────────────────┬───┬──────────────────┬─────────────────┬─────────┐
487-
│ url_surtkey │ url │ url_host_name │ … │ warc_segment │ crawl │ subset │
488-
│ varchar │ varchar │ varchar │ │ varchar │ varchar │ varchar │
489-
├──────────────────────┼──────────────────────┼──────────────────┼───┼──────────────────┼─────────────────┼─────────┤
490-
│ org,wikipedia,an)/… │ https://an.wikiped… │ an.wikipedia.org │ … │ 1715971057216.39 │ CC-MAIN-2024-22 │ warc │
491-
├──────────────────────┴──────────────────────┴──────────────────┴───┴──────────────────┴─────────────────┴─────────┤
492-
│ 1 rows 32 columns (6 shown) │
493-
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
494-
495-
complete row:
456+
Using algorithm: cloudfront
457+
Total records for crawl: CC-MAIN-2024-22
458+
100% ▕████████████████████████████████████████████████████████████▏
459+
2709877975
460+
461+
Our one row:
462+
100% ▕████████████████████████████████████████████████████████████▏
463+
url_surtkey | url | url_host_name | url_host_tld | url_host_2nd_last_part | url_host_3rd_last_part | url_host_4th_last_part | url_host_5th_last_part | url_host_registry_suffix | url_host_registered_domain | url_host_private_suffix | url_host_private_domain | url_host_name_reversed | url_protocol | url_port | url_path | url_query | fetch_time | fetch_status | fetch_redirect | content_digest | content_mime_type | content_mime_detected | content_charset | content_languages | content_truncated | warc_filename | warc_record_offset | warc_record_length | warc_segment | crawl | subset
464+
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
465+
org,wikipedia,an)/wiki/escopete | https://an.wikipedia.org/wiki/Escopete | an.wikipedia.org | org | wikipedia | an | NULL | NULL | org | wikipedia.org | org | wikipedia.org | org.wikipedia.an | https | NULL | /wiki/Escopete | NULL | 2024-05-18T01:58:10Z | 200 | NULL | RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU | text/html | text/html | UTF-8 | spa | NULL | crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz | 80610731 | 17423 | 1715971057216.39 | CC-MAIN-2024-22 | warc
466+
467+
Writing our one row to a local parquet file, whirlwind.parquet
468+
100% ▕████████████████████████████████████████████████████████████▏
469+
Total records for local whirlwind.parquet should be 1:
470+
1
471+
472+
Our one row, locally:
473+
url_surtkey | url | url_host_name | url_host_tld | url_host_2nd_last_part | url_host_3rd_last_part | url_host_4th_last_part | url_host_5th_last_part | url_host_registry_suffix | url_host_registered_domain | url_host_private_suffix | url_host_private_domain | url_host_name_reversed | url_protocol | url_port | url_path | url_query | fetch_time | fetch_status | fetch_redirect | content_digest | content_mime_type | content_mime_detected | content_charset | content_languages | content_truncated | warc_filename | warc_record_offset | warc_record_length | warc_segment | crawl | subset
474+
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
475+
org,wikipedia,an)/wiki/escopete | https://an.wikipedia.org/wiki/Escopete | an.wikipedia.org | org | wikipedia | an | NULL | NULL | org | wikipedia.org | org | wikipedia.org | org.wikipedia.an | https | NULL | /wiki/Escopete | NULL | 2024-05-18T01:58:10Z | 200 | NULL | RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU | text/html | text/html | UTF-8 | spa | NULL | crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz | 80610731 | 17423 | 1715971057216.39 | CC-MAIN-2024-22 | warc
476+
477+
Complete row:
496478
url_surtkey org,wikipedia,an)/wiki/escopete
497479
url https://an.wikipedia.org/wiki/Escopete
498480
url_host_name an.wikipedia.org
499481
url_host_tld org
500482
url_host_2nd_last_part wikipedia
501483
url_host_3rd_last_part an
502-
url_host_4th_last_part None
503-
url_host_5th_last_part None
484+
url_host_4th_last_part null
485+
url_host_5th_last_part null
504486
url_host_registry_suffix org
505487
url_host_registered_domain wikipedia.org
506488
url_host_private_suffix org
507489
url_host_private_domain wikipedia.org
508490
url_host_name_reversed org.wikipedia.an
509491
url_protocol https
510-
url_port nan
492+
url_port null
511493
url_path /wiki/Escopete
512-
url_query None
513-
fetch_time 2024-05-18 01:58:10+00:00
494+
url_query null
495+
fetch_time 2024-05-18T01:58:10Z
514496
fetch_status 200
515-
fetch_redirect None
497+
fetch_redirect null
516498
content_digest RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU
517499
content_mime_type text/html
518500
content_mime_detected text/html
519501
content_charset UTF-8
520502
content_languages spa
521-
content_truncated None
503+
content_truncated null
522504
warc_filename crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz
523505
warc_record_offset 80610731
524506
warc_record_length 17423
525507
warc_segment 1715971057216.39
526508
crawl CC-MAIN-2024-22
527509
subset warc
528510
529-
equivalent to cdxj:
530-
org,wikipedia,an)/wiki/escopete 20240518015810 {"url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "status": "200", "digest": "sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17423", "offset": "80610731", "filename": "crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz"}
531-
511+
Equivalent to CDXJ:
512+
org,wikipedia,an)/wiki/escopete 20240518015810 {"url":"https://an.wikipedia.org/wiki/Escopete","mime":"text/html","status":"200","digest":"sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU","length":"17423","offset":"80610731","filename":"crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz"}
532513
```
533514
</details>
534515

pom.xml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,16 @@
2626
<artifactId>jwarc</artifactId>
2727
<version>0.33.0</version>
2828
</dependency>
29+
<dependency>
30+
<groupId>org.duckdb</groupId>
31+
<artifactId>duckdb_jdbc</artifactId>
32+
<version>1.1.3</version>
33+
</dependency>
34+
<dependency>
35+
<groupId>com.google.code.gson</groupId>
36+
<artifactId>gson</artifactId>
37+
<version>2.11.0</version>
38+
</dependency>
2939
</dependencies>
3040

3141
<build>

0 commit comments

Comments
 (0)