Common Crawl on Laptop - Building Web Directory

This series of posts discuss processing of common crawl dataset on laptop.

  1. Extracting Subset of Common Crawl
  2. Building web directory (this post)

Introduction

In the earlier post, we have extracted all telugu web page urls to a csv file. In this post, let's explore these urls and build a web directory from it.

Explore Data

Let's see how many urls are present in the extracted subset of data.

$ wc -l telugu.csv
  852025 telugu.csv 

In the earlier post, we have installed duckdb and used it for processing parquet files. duckdb can execute SQL queries directly on csv file. Let's use it to explore the data stored in telugu.csv.

Let's see how many unique domains are present in the data.

$ duckdb -c """
    SELECT COUNT(DISTINCT url_host_name_reversed) as unique_sites
    FROM read_csv('telugu.csv', auto_detect = TRUE);
"""
┌──────────────┐
│ unique_sites │
├──────────────┤
│ 13632        │
└──────────────┘

There ~14k unique domains. Let's see page density across these domains.

$ duckdb -c """
SELECT count    AS page_count,
COUNT(*) AS sites
FROM (SELECT url_host_name_reversed, COUNT(*) AS count
FROM read_csv('te.csv', auto_detect = TRUE)
GROUP BY url_host_name_reversed) AS t
GROUP BY page_count
ORDER BY page_count;
"""
┌────────────┬───────┐
│ page_count  sites │
├────────────┼───────┤
│ 1           6326  │
│ 2           1904  │
│ 3           733   │
│ 4           459   │
│ 5           315   

About ~75% of the sites have less than 5 pages. It is highly unlikely that these sites complete content is in Telugu language. After manually checking a few of these sites, I found that there are a lot of false positives.

In the earlier post, we have extracted all pages where there is Telugu language content. Let's filter out pages where Telugu is primary language.

$ duckdb -c """
  COPY (
    SELECT * FROM read_csv('cct.csv', auto_detect=true) 
    WHERE content_languages like 'tel%'
  ) TO 'te_primary.csv' (DELIMITER ',', HEADER TRUE);
"""
$ wc -l te_primary.csv
  573130 te_primary.csv
$ duckdb -c "SELECT COUNT(DISTINCT url_host_name_reversed) as unique_sites FROM read_csv('te_primary.csv', auto_detect = TRUE)"                           
┌──────────────┐
│ unique_sites │
├──────────────┤
│ 5666         │
└──────────────┘    

Let's see how page density per domain has changed.

$ duckdb -c """
SELECT count    AS page_count,
COUNT(*) AS sites
FROM (SELECT url_host_name_reversed, COUNT(*) AS count
FROM read_csv('te_primary.csv', auto_detect = TRUE)
GROUP BY url_host_name_reversed) AS t
GROUP BY page_count
ORDER BY page_count
;
"""
┌────────────┬───────┐
│ page_count  sites │
├────────────┼───────┤
│ 1           2183  │
│ 2           843   │
│ 3           235   │
│ 4           146   │
│ 5           98    

Page density remains almost the same.

Let's filter out sites which have at least 5 pages in Telugu. This will eliminate a lot of false positives. Let's look at the most popular sites from the results.

   1    Rank,Domain,Open Page Rank
   2    25,support.google.com,8.55
   3    57,t.me,7.76
   4    76,chrome.google.com,7.49
   5    163,support.mozilla.org,6.99
   6    170,groups.google.com,6.94

A lot of unrelated domains are present here because there might be 10+ pages in telugu in these domains as well. But we don't need these.

Let's look at only home page(or translated home page) where primary content language is telugu.

$ duckdb -c """
  SELECT COUNT(distinct url) 
  FROM read_csv('te_primary.csv', auto_detect=true) 
  WHERE (url_path = '/' or url_path = '/te/') and url_query is null;
"""

Now the domain count has reduced to 6k. Let's export these domains to csv file.

To categorize these domains, Common-crawl doesn't yet provide any kind of categorisation. For now, we can use Open PageRank to sort these domains based on rank.

We can download top 10 million domains from Open PageRank3. Here is a simple python script to extract telugu domains from the list.

import pandas as pd

domains_file = 'domains.csv'
with open(domains_file, 'r') as f:
    telugu_domains = [line.strip() for line in f.readlines()]

telugu_domains = ['.'.join(reversed(domain.split('.'))) for domain in telugu_domains]

df = pd.read_csv('t10m.csv')
df = df[df['Domain'].isin(telugu_domains)]

df.to_csv('t10m_telugu.csv', index=False)

Now, we have list of all telugu domains sorted by rank. In the next post, we will use this list to categorize the domains.

Common Crawl On Laptop - Extracting Subset Of Data

This series of posts discuss processing of common crawl dataset on laptop.

  1. Extracting Subset of Common Crawl (this post)
  2. Building web directory

Introduction

Common Crawl(CC)1 is an open repository of web containing peta bytes of data since 2008. As the dataset is huge, most of the tutorials use AWS EMR/Athena to process the data.

In this post, let's learn how to extract a subset of data(entire telugu language web pages) and process it on our local machine.

Exploring Common Crawl

CC provides monthly data dumps in WARC format. Each crawl consists of about ~3 billion web pages with a compressed size of ~100 TB.

In addition to WARC files, CC provides index files as well as columnar index2 files so that users can easily search, filter and download the data.

Common Crawl Index

Each crawl index is spread over 300 files consisting of ~250 GB of data. For this post, let use the latest crawl which is CC-MAIN-2022-40.

The index files can be accessed from AWS S3 or https. We can use aws cli to list all the files along with the sizes.

$ aws s3 ls --recursive --human-readable --summarize s3://commoncrawl/cc-index/collections/CC-MAIN-2022-40
2022-10-08 16:07:59  621.9 MiB cc-index/collections/CC-MAIN-2022-40/indexes/cdx-00000.gz
2022-10-08 16:08:26  721.6 MiB cc-index/collections/CC-MAIN-2022-40/indexes/cdx-00001.gz
...
2022-10-08 16:42:39  146.6 MiB cc-index/collections/CC-MAIN-2022-40/indexes/cluster.idx
2022-10-08 16:42:33   30 Bytes cc-index/collections/CC-MAIN-2022-40/metadata.yaml

Total Objects: 302
   Total Size: 236.1 GiB

Let's download an index file to our local machine and see how the data is arranged. We can use aws cli to download the data from s3 bucket or use wget to download it from https endpoint.

# from s3
$ aws s3 cp s3://commoncrawl/cc-index/collections/CC-MAIN-2022-40/indexes/cdx-00000.gz .

# from https
$ wget https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2022-40/indexes/cdx-00000.gz

Let's print top five lines of the file.

$ zcat < cdx-00000.gz | head -n 5
0,1,184,137)/1klikbet 20221005193707 {"url": "http://137.184.1.0/1klikbet/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "XTKGORHKLZCHDBBOMYCYYIZVRPMXNRII", "length": "7065", "offset": "83437", "filename": "crawl-data/CC-MAIN-2022-40/segments/1664030337663.75/warc/CC-MAIN-20221005172112-20221005202112-00011.warc.gz", "charset": "UTF-8", "languages": "ind"}
0,1,184,137)/7meter 20221005192131 {"url": "http://137.184.1.0/7meter/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "KUJAMRT6MXYR3RTWRJTIWJ5T2ZUB3EBH", "length": "7456", "offset": "142680", "filename": "crawl-data/CC-MAIN-2022-40/segments/1664030337663.75/warc/CC-MAIN-20221005172112-20221005202112-00182.warc.gz", "charset": "UTF-8", "languages": "ind"}
...

The last column of each line contains the language information. We can use these index files, and we can extract all the lines containing tel language code.

Columnar Index

We can also use columnar index to filter out telugu language web pages. Let's download a single file from the index.

# from s3
$ aws s3 cp s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2022-40/subset=warc/part-00000-26160df0-1827-4787-a515-95ecaa2c9688.c000.gz.parquet .

# from https
$ wget https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2022-40/subset=warc/part-00000-26160df0-1827-4787-a515-95ecaa2c9688.c000.gz.parquet

We can use Python pandas to read the parquet file and filter out telugu language web pages. Columnar index has content_languages column which can be used to filter out telugu pages as shown below.

$ python -c """
import pandas as pd
filename = 'part-00000-26160df0-1827-4787-a515-95ecaa2c9688.c000.gz.parquet'
df = pd.read_parquet(filename)
df = df[df['content_languages'].str.startswith('tel', na=False)]
df.to_csv('telugu.csv')
"""

I have used Macbook M1 with local ISP(Internet Service Provider) to download and extract the index. It took around 7 minutes to download a single file and 2 minutes to extract the data. To process 300 index files, it takes ~2 days.

Let's see how we can speed it up.

Improving Performance

Faster Downloads

My Wi-Fi speed is ~4MBps when downloading the index file. To download faster, I have created t2.micro(free-tier) EC2 instance on AWS. In this machine, download speed is ~10MBps. We can use other instances, but I am trying to use only free resources. In this machine, single file download is taking ~3 minutes.

CC dataset is hosted in us-east-1 region. So, I have created a new t2.micro instance in us-east-1 region. This instance is taking <20 seconds to download a single file. We can download entire index in less than 2 hours.

Faster Performance

To extract data from index files, we have used Python pandas without specifying the engine. By default, it uses pyarrow which is a bit slow. To improve speed we can use fastparquet as engine which is ~5x faster than pyarrow.

import pandas as pd

filename = 'part-00000-26160df0-1827-4787-a515-95ecaa2c9688.c000.gz.parquet'
df = pd.read_parquet(filename, engine='fastparquet')

To get better performance, we can use duckdb. Duckdb is an in-process SQL OLAP DBMS and it can execute SQL queries directly on parquet files with parquet extension.

$ brew install duckdb

$ duckdb -c 'INSTALL parquet;'

We can write a simple SQL query to filter out the required rows.

$ duckdb -c """
LOAD parquet;
COPY (select * from PARQUET_SCAN('part-00000-26160df0-1827-4787-a515-95ecaa2c9688.c000.gz.parquet') where content_languages ilike '%tel%') TO 'telugu.csv' (DELIMITER ',', HEADER TRUE);
"""

Duckdb can execute SQL queries on remote files as well with httpfs extension.

$ duckdb -c 'INSTALL httpfs;'

$ duckdb -c """
    LOAD httpfs;
    LOAD parquet;

    COPY (select * from PARQUET_SCAN('s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2022-40/subset=warc/part-00001-26160df0-1827-4787-a515-95ecaa2c9688.c000.gz.parquet') where content_languages ilike '%tel%') TO 'telugu.csv' (DELIMITER ',', HEADER TRUE);"""
"""

Duckdb can also read series of parquet files and treat them as a single table. We can use this feature to process all the index files in a single command.

$ duckdb -c """
    LOAD httpfs;
    LOAD parquet;

    SET s3_region='us-east-1';
    SET s3_access_key_id='s3_secret_access_key';
    SET s3_secret_access_key='s3_secret_access_key';

    COPY (select * from PARQUET_SCAN('s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2022-40/subset=warc/*.parquet') where content_languages ilike '%tel%') TO 'telugu.csv' (DELIMITER ',', HEADER TRUE);
"""

Depending on the file size, duckdb takes 10-15 seconds to process a single file. Since we don't need all the columns for further data processing, we can limit columns to required 5 columns.

$ duckdb -c """
    COPY (select url, content_languages, warc_filename, warc_record_offset, warc_record_length from PARQUET_SCAN('s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2022-40/subset=warc/*.parquet') where content_languages ilike '%tel%') TO 'telugu.csv' (DELIMITER ',', HEADER TRUE);
"""

By limiting columns3 there is another 65% improvement in performance. Now duckdb can process a file in 3 to 8 seconds depending on the size of the file. We can process entire index in ~20 minutes.

Conclusion

With a single command, we can extract a subset of index from CC in ~2 hours. So far we have processed all files in a single process. We can also parallelize the process using parallel to get faster results.

In the upcoming posts, let's see how we can fetch the data from WARC files using this index and do further data processing.