@@ -453,82 +453,63 @@ On a machine with a 1 gigabit network connection and many cores, this should tak
453453 <summary >Click to view output</summary >
454454
455455```
456- warning! this might take 1-10 minutes
457- python duck.py cloudfront
458- total records for crawl: CC-MAIN-2024-22
459- ┌──────────────┐
460- │ count_star() │
461- │ int64 │
462- ├──────────────┤
463- │ 2709877975 │
464- └──────────────┘
465-
466- our one row
467- ┌──────────────────────┬──────────────────────┬──────────────────┬───┬──────────────────┬─────────────────┬─────────┐
468- │ url_surtkey │ url │ url_host_name │ … │ warc_segment │ crawl │ subset │
469- │ varchar │ varchar │ varchar │ │ varchar │ varchar │ varchar │
470- ├──────────────────────┼──────────────────────┼──────────────────┼───┼──────────────────┼─────────────────┼─────────┤
471- │ org,wikipedia,an)/… │ https://an.wikiped… │ an.wikipedia.org │ … │ 1715971057216.39 │ CC-MAIN-2024-22 │ warc │
472- ├──────────────────────┴──────────────────────┴──────────────────┴───┴──────────────────┴─────────────────┴─────────┤
473- │ 1 rows 32 columns (6 shown) │
474- └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
475-
476- writing our one row to a local parquet file, whirlwind.parquet
477- total records for local whirlwind.parquet should be 1
478- ┌──────────────┐
479- │ count_star() │
480- │ int64 │
481- ├──────────────┤
482- │ 1 │
483- └──────────────┘
484-
485- our one row, locally
486- ┌──────────────────────┬──────────────────────┬──────────────────┬───┬──────────────────┬─────────────────┬─────────┐
487- │ url_surtkey │ url │ url_host_name │ … │ warc_segment │ crawl │ subset │
488- │ varchar │ varchar │ varchar │ │ varchar │ varchar │ varchar │
489- ├──────────────────────┼──────────────────────┼──────────────────┼───┼──────────────────┼─────────────────┼─────────┤
490- │ org,wikipedia,an)/… │ https://an.wikiped… │ an.wikipedia.org │ … │ 1715971057216.39 │ CC-MAIN-2024-22 │ warc │
491- ├──────────────────────┴──────────────────────┴──────────────────┴───┴──────────────────┴─────────────────┴─────────┤
492- │ 1 rows 32 columns (6 shown) │
493- └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
494-
495- complete row:
456+ Using algorithm: cloudfront
457+ Total records for crawl: CC-MAIN-2024-22
458+ 100% ▕████████████████████████████████████████████████████████████▏
459+ 2709877975
460+
461+ Our one row:
462+ 100% ▕████████████████████████████████████████████████████████████▏
463+ url_surtkey | url | url_host_name | url_host_tld | url_host_2nd_last_part | url_host_3rd_last_part | url_host_4th_last_part | url_host_5th_last_part | url_host_registry_suffix | url_host_registered_domain | url_host_private_suffix | url_host_private_domain | url_host_name_reversed | url_protocol | url_port | url_path | url_query | fetch_time | fetch_status | fetch_redirect | content_digest | content_mime_type | content_mime_detected | content_charset | content_languages | content_truncated | warc_filename | warc_record_offset | warc_record_length | warc_segment | crawl | subset
464+ --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
465+ org,wikipedia,an)/wiki/escopete | https://an.wikipedia.org/wiki/Escopete | an.wikipedia.org | org | wikipedia | an | NULL | NULL | org | wikipedia.org | org | wikipedia.org | org.wikipedia.an | https | NULL | /wiki/Escopete | NULL | 2024-05-18T01:58:10Z | 200 | NULL | RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU | text/html | text/html | UTF-8 | spa | NULL | crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz | 80610731 | 17423 | 1715971057216.39 | CC-MAIN-2024-22 | warc
466+
467+ Writing our one row to a local parquet file, whirlwind.parquet
468+ 100% ▕████████████████████████████████████████████████████████████▏
469+ Total records for local whirlwind.parquet should be 1:
470+ 1
471+
472+ Our one row, locally:
473+ url_surtkey | url | url_host_name | url_host_tld | url_host_2nd_last_part | url_host_3rd_last_part | url_host_4th_last_part | url_host_5th_last_part | url_host_registry_suffix | url_host_registered_domain | url_host_private_suffix | url_host_private_domain | url_host_name_reversed | url_protocol | url_port | url_path | url_query | fetch_time | fetch_status | fetch_redirect | content_digest | content_mime_type | content_mime_detected | content_charset | content_languages | content_truncated | warc_filename | warc_record_offset | warc_record_length | warc_segment | crawl | subset
474+ --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
475+ org,wikipedia,an)/wiki/escopete | https://an.wikipedia.org/wiki/Escopete | an.wikipedia.org | org | wikipedia | an | NULL | NULL | org | wikipedia.org | org | wikipedia.org | org.wikipedia.an | https | NULL | /wiki/Escopete | NULL | 2024-05-18T01:58:10Z | 200 | NULL | RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU | text/html | text/html | UTF-8 | spa | NULL | crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz | 80610731 | 17423 | 1715971057216.39 | CC-MAIN-2024-22 | warc
476+
477+ Complete row:
496478 url_surtkey org,wikipedia,an)/wiki/escopete
497479 url https://an.wikipedia.org/wiki/Escopete
498480 url_host_name an.wikipedia.org
499481 url_host_tld org
500482 url_host_2nd_last_part wikipedia
501483 url_host_3rd_last_part an
502- url_host_4th_last_part None
503- url_host_5th_last_part None
484+ url_host_4th_last_part null
485+ url_host_5th_last_part null
504486 url_host_registry_suffix org
505487 url_host_registered_domain wikipedia.org
506488 url_host_private_suffix org
507489 url_host_private_domain wikipedia.org
508490 url_host_name_reversed org.wikipedia.an
509491 url_protocol https
510- url_port nan
492+ url_port null
511493 url_path /wiki/Escopete
512- url_query None
513- fetch_time 2024-05-18 01 :58:10+00:00
494+ url_query null
495+ fetch_time 2024-05-18T01 :58:10Z
514496 fetch_status 200
515- fetch_redirect None
497+ fetch_redirect null
516498 content_digest RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU
517499 content_mime_type text/html
518500 content_mime_detected text/html
519501 content_charset UTF-8
520502 content_languages spa
521- content_truncated None
503+ content_truncated null
522504 warc_filename crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz
523505 warc_record_offset 80610731
524506 warc_record_length 17423
525507 warc_segment 1715971057216.39
526508 crawl CC-MAIN-2024-22
527509 subset warc
528510
529- equivalent to cdxj:
530- org,wikipedia,an)/wiki/escopete 20240518015810 {"url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "status": "200", "digest": "sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17423", "offset": "80610731", "filename": "crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz"}
531-
511+ Equivalent to CDXJ:
512+ org,wikipedia,an)/wiki/escopete 20240518015810 {"url":"https://an.wikipedia.org/wiki/Escopete","mime":"text/html","status":"200","digest":"sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU","length":"17423","offset":"80610731","filename":"crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz"}
532513```
533514</details >
534515
0 commit comments