Alexa vs Domcop vs Majestic - Top Million Sites
Introduction
Alexa1, Domcop2(based on CommonCrawl3 data) Majestic4 & provide top 1 million popular websites based on their analytics. In this article we will download this data and compare them using Linux command line tools.
Collecting data
Let's download data from above sources and extract domain names. The data format is different for each source. We can use awk
tool to extract domains column from the source. After extracting data, sort it and save it to a file.
Extracting domains from alexa.
# alexa $ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip $ unzip top-1m.csv.zip # data sorted by ranking $ head -n 5 top-1m.csv 1,google.com 2,youtube.com 3,facebook.com 4,baidu.com 5,wikipedia.org $ awk -F "," '{print $2}' top-1m.csv | sort > alexa # domains after sorting alphabetically $ head -n 5 alexa 00000.life 00-000.pl 00004.tel 00008888.tumblr.com 0002rick.tumblr.com
Extracting domain names from domcop.
# Domcop $ wget https://www.domcop.com/files/top/top10milliondomains.csv.zip $ unzip top10milliondomains.csv.zip # data sorted by ranking $ head -n 5 top10milliondomains.csv "Rank","Domain","Open Page Rank" "1","fonts.googleapis.com","10.00" "2","facebook.com","10.00" "3","youtube.com","10.00" "4","twitter.com","10.00" $ awk -F "\"*,\"*" '{if(NR>1)print $2}' top10milliondomains.csv.zip | sort > domcop # domains after sorting alphabetically $ head -n 5 domcop 00000000b.com 000000book.com 0000180.fortunecity.com 000139418.wixsite.com 000fashions.blogspot.com
Extracting domain names from majestic.
# Majestic $ wget http://downloads.majestic.com/majestic_million.csv # data sorted by ranking $ head -n 5 majestic_million.csv GlobalRank,TldRank,Domain,TLD,RefSubNets,RefIPs,IDN_Domain,IDN_TLD,PrevGlobalRank,PrevTldRank,PrevRefSubNets,PrevRefIPs 1,1,google.com,com,474277,3016409,google.com,com,1,1,474577,3012875 2,2,facebook.com,com,462854,3093315,facebook.com,com,2,2,462860,3090006 3,3,youtube.com,com,422434,2504924,youtube.com,com,3,3,422377,2501555 4,4,twitter.com,com,412950,2497935,twitter.com,com,4,4,413220,2495261 $ awk -F "\"*,\"*" '{if(NR>1)print $2}' majestic_million.csv | sort > majestic # domains after sorting alphabetically $ head -n 5 majestic 00000.xn--p1ai 0000666.com 0000.jp 0000www.com 0000.xn--p1ai
Comparing Data
We have collected and extracted domains from above sources. Let's compare the domains to see how similar they are using comm
tool.
$ comm -123 alexa domcop --total 871851 871851 128149 total $ comm -123 alexa majestic --total 788454 788454 211546 total $ comm -123 domcop majestic --total 784388 784388 215612 total
$ comm -12 alexa domcop | comm -123 - majestic --total 31314 903165 96835 total
So, only 96,835(9.6%) domains are common between all the datasets and the overlap between any two sources is ~20%. Here is a venn diagram showing the overlap between them.
Conclusion
We have collected data from alexa, domcorp & majestic, extracted domains from it and observed that there is only a small overlap between them.