Every language on the web has strong links to English, usually with around twenty percent of offsite links and occasionally over forty five percent, such as from Tagalog/Filipino, spoken in the Philippines, and Urdu, principally spoken in Pakistan . Both the Philippines and Pakistan are former British colonies where English is one of the two official languages.
Examining links between other languages, it seems that many are explained by people and communities which speak both languages.
The language webs of many former Soviet republics link back to the Russian web, with the strongest link from Ukrainian. While Russia is the major importer of Ukrainian products, the bilingual nature of Ukraine is a more plausible explanation. Most Ukrainians speak both languages, and Russian is even the dominant language in large parts of the country.
The link from Arabic to French speaks to the long connection between France and its former colonies. In many of these countries Arabic and French are now commonly spoken together, and there has been significant emigration from these countries to France.
Another strong link is between the Malay/Malaysian and Indonesian webs. Malaysia and Indonesia share a border, but more importantly the languages are nearly eighty percent cognate, meaning speakers of one can easily understand the other.
You might wonder whether off-site links landing on English pages can be explained simply by the number of English pages available to be linked to. The webs of other languages in our corpus typically have sixty to eighty percent of their out-language links to English pages. However, only 38 percent of the pages and 42 percent of sites in our set are English, while it attracts 79 percent of all out-language links from other languages.
Chinese and Japanese also seem unusual because there are relatively few links from pages in these languages to pages in English. This is despite the fact that Japanese and Chinese sites are the most popular non-English sites for English sites to link to. However, the number of sites in a language is a strong predictor of its `introversion’, or fraction of off-site links to pages in the same language. Taking this into account shows that Chinese and Japanese webs are not unusually introverted given their size. In general, language webs with more sites are more introverted, perhaps due to better availability of content.
There is a roughly linear relationship between the (log) number of sites in a language and the fraction of off-site links which point to pages in the same language, with a correlation of 0.9 if English is removed. However, only 45 percent of off-site links from English pages are to other English pages, making English the most extroverted web language given its size. Other notable outliers are the Hindi web, which is unusually introverted, and the Tagalog and Malay webs which are unusually extroverted.
We can generate another map by connecting languages if the number of links from one to the other is 50 times greater than expected given the number of out-of-language links and the size of the language linked . We see, the native languages of India show up clearly. Surprising links include those from Hindi to Ukrainian, Kurdish to Swedish, Swahili to Tagalog and Bengali, and Esperanto to Polish.
What’s happened since 2008? The languages of the web have become more densely connected. There is now significant content in even more languages, and these languages are more closely linked. We hope that tools like the Polyglots Team translation, voice translation, and other services will accelerate this process and bring more people in the world closer together, whichever languages they speak.