The End Of The Language Barrier
Brian McConnell (Twitter), Founder, Worldwide Lexicon Project
NOTE: if you would like to view, create or edit translations for this essay, download our free Firefox Translator. With it, you can view and edit translations for any website in over 50 languages. If you would like to donate to support ongoing development, consider joining our fundraising campaign.
If you are a developer, and would like to read about this in more detail, read this white paper.
While the Internet and the worldwide web have grown to become global in reach, rendering time and distance moot, they are still fragmented by language. The web is not really a single entity, but rather many networks, each relatively isolated from the others. Information now travels quite freely within a language, with few remaining economic or editorial controls over who can publish, and what people can read. Information does not travel so freely between languages. Important news stories and especially interesting pages may be picked up for translation, but overall, the flow of information across languages is inefficient, slow, and unpredictable. We need a multilingual web. What might this look like?
For ten years, I have worked on the Worldwide Lexicon
, an open source project to create translation platforms. In recent months, we have made significant progress, and have begun testing tools that offer a glimpse of what the multilingual web could be. Like most first generation tools, ours are crude and can be much improved, but they are a working example of what to aim for.
From a product or system design standpoint, what should we want from a multilingual web? From a user's standpoint, this is easy. A user should be able to open a page, and if it is in a foreign language for them, the web server or their browser should make a best effort attempt to translate it using the best available human and machine translations without intervention. You can see an example of this with our Firefox Translator
. It does this using a combination of human edited translations submitted by other users, along with machine translations from several sources. This tool is an addon, and performance issues aside, is a good example of what a multilingual web browser would do. When this capability is fully developed, it should become invisible to most users. When it is embedded in the web browser, every user will have this capability. Translation will become an ambient, automatic service.
The tricky aspect of this is that human language is impossible for computers to comprehend in any meaningful way. There is no single tool or algorithm that can be used to translate human language. Machine translation is good for quickly generating approximate translations, especially where grammar or style don't matter so much. Humans are much better at understanding context, and when they are translating into their native language, are the best at capturing the style and feel of a language. Barring a radical advance in artificial intelligence, this is likely to remain the case for a long time. To build a multilingual web, we need to use a variety of tools, including machine translation systems (there are several major types), translation memories (which store large volumes of human edited translations), and other tools such as dictionaries. In combination, you can build a pretty decent system that draws from different sources where they perform best.
Our vision at WWL is to make human/machine translation an embedded service, part of the collection of open services and protocols that comprise the Internet. This will happen within 2 to 3 years. The Firefox addon is a working example of what this will look like to ordinary users. By merely improving on this, to make it faster and to embed this functionality in systems like web servers, this will become part of the standard web services "stack", in the parlance of software engineers. When we reach that point, which could arrive sooner than most people realize, the web will be transparent to most languages. The multilingual web will be what's what's known as a "best effort" system. Billions of people will use these services, which will call out to retrieve the best available professional, volunteer and machine translations on demand, but they will be a mostly invisible to their users. They will become part of the web, and soon, a user will be able to open a page, read it in his language, without needing to do anything, or know how this is being done.
Building the Multilingual Web : Four Easy Pieces
So how, specifically, do we build the multilingual web? The idea of eliminating the language barrier for every user and website sounds ambitious, but by breaking this into several smaller tasks, we see that not only is this possible, but that all of the pieces required to build this already exist. To make this reality, we simply need to improve on what has already been built, and to embed it in as many different systems and services as possible. Then, in just a few years, billions of people will be using these tools, most of them without realizing it.
The multilingual web will consist of four pieces which work together to make the entire web translatable:
- Web browser extensions
- Web server extensions
- Global translation memory
- Language service providers
A multilingual web browser will look and act just like today's browser. The only difference is that it has been extended to include translation features. From a user's viewpoint, this is an invisible tool. The user opens a web page, and if it is in a foreign language, the translation software will activate itself and call a human/machine translation server to request the best available human or machine translations, and then redraw the web page with the translations.
Our Firefox Translator
provides a good example of what this will look like. It automatically detects if a page needs to be translated, and if so, calls WWL and machine translation services to request the best available translations. Users can edit and score translations in the browser, simply by mousing over a translation to display a popup editor which saves the edits back to the translation memory, where they become available to other users. It also provides numerous options for displaying and color coding translations by type and quality.
The only thing we need to do now is to embed this technology in the browser, so it is built in to every copy of Firefox and other popular browsers. Until that happens, users can download addons, such as the ones we are developing for Firefox (and soon for other popular web browsers). This piece already exists today, but because users must make the extra step to find and install an addon, it is still unknown to most people. It just needs to be improved slightly, to make it faster, and to be embedded in popular browsers by default.
Web Server Extensions
By embedding translation software within popular web servers, such as the Apache server, we can make human/machine translation part of the web services "stack", or LAMP, as it's known in the parlance of software engineers. This will enable a webmaster to install a module, edit a configuration file, and translate an entire website on the fly. Any application or documents hosted on that web server will be translated as they are sent over the wire, again using the best combination of human and machine translations, per the owner's policies. (Embedding WWL type functionality in proxy servers is another way to do this).
Anyone who visits that site, even if they are using an older web browser or mobile browser without translation built in, will see the pages translated into their language. The web server software also uses this best effort approach to translate pages using the best available human and machine translations as the documents are being transmitted to users.
Some of this already exists, in the form of a simple web API, on which you can build server based translation scripts that can call out to request human and machine translations as needed. This works pretty well, and we are developing a high performance library called TransKit that will enable web server and web application developers to embed human/machine translation in almost any system. We hope to release an experimental version of this in October, along with a Apache module as a reference implementation. This module is being written in C and is designed for speed so that real-time translation does not noticeably affect page load times and other measures of performance. This will also be open source, so that developers can embed this library in a wide range of web servers, web applications and embedded systems.
Global Translation Memory
The third key piece is a global, web scale translation memory. A translation memory is a database of texts, their translations, and a revision history of translations to various languages. It records volunteer and professional translations from users and paid translators. The translation memory is accessed by web servers, browsers and other applications that need translation services via a simple web API. This component already exists, and is in production use. The Worldwide Lexicon
translation server, based on open source software written in Python
, and hosted on Google's grid computing platform App Engine
, serves as a global translation memory. Developers can use the public translation memory, available at www.worldwidelexicon.org, or can download the source code and deploy their own instance (for a private in-company translation memory for example). Organizations such as TAUS are also developing shared public translation memories and APIs.
Our vision for the WWL translation memory is for it to be a global, open content corpus of translations. The system functions as a network of translation servers, so translations submitted to one translation server can be automatically shared with a global network of translation memories. Over time this will grow to encompass billions of texts and their translations. This is important because the translation memory, like Wikipedia
, will be open content. Most translation memories today are proprietary systems that are hidden behind corporate firewalls. Translation corpora are difficult and expensive to create, and so most corporations are reluctant to share theirs. WWL is an open system, and will provide translation users, and researchers, with a large and growing source of high quality translations.
Language Service Providers
Language service providers offer real-time or on demand translation, and are an important component of this ecosystem. LSPs fall into two broad categories: machine translation services, and professional translation service bureaus. Machine translation services such as Babelfish
, Google Translate
, Apertium and Moses, enable users to quickly obtain approximate translations to and from about 50 languages (enough to cover > 95% of the Internet population). While these translations often contain errors, users can generally understand the source material with some effort. Professional translation services offer the option to pay for professional translators. Some of them have built highly automated web interfaces that enable systems like WWL to request professional translations on the fly. ProZ.com
(one of the services we have integrated with), has created something similar to Amazon's Mechanical Turk
service, where you can request a translation for a block of text via a quick web API call, and then receive the translation, often within minutes.
While there is no uniform standard for communicating with these services, a protocol in engineering jargon, they are all relatively easy to communicate with, and therefore to incorporate into software such as WWL, browser translators, etc in their current state. So, this component of the multilingual web also exists and is already at a pretty mature stage of technological and market development.
The End of the Language Barrier?
Two years ago, I predicted that the language barrier would cease to exist on the web by around 2010, as people began using embedded translation tools, and began editing translations en masse. The foundation for the multilingual web now exists, and what remains to be done is mostly a matter of improving on the tools already built, making them work together, and to embed them in other systems so that this technology becomes widely accessible. While that is a big goal and a big prediction, we are on the verge of that today. The Firefox Translator
, just now entering its 1.0 release as an open source Firefox extension, offers a preview of what should become a standard feature in all web browsers in some form in the upcoming years.
Numerous online communities have emerged in recent years that focus on translation, and have proven that volunteer based models can work very well. Among the leaders are: Global Voices/Lingua
, which translates blogs around the world, Meadan
, which translates English/Arabic news and commentary, and YeeYan
, which translates English news and commentary into Chinese, while Wikipedia
has built a huge translation community that actively creates and translates content in dozens of languages.
I believe we are entering a transition where the language barrier will fade away rather quickly for web users. While it won't happen instantly, and the web won't be translated perfectly, accessing foreign language websites and services will become effortless, automatic, and as more humans get involved to edit and correct translations, better quality. Already this is reality for thousands of early adopters, and as these tools and others like them become ubiquitous, the web as a whole will become transparent across languages.
When that happens, billions of people will be using the multilingual web, although the underlying technology will be, like other Internet infrastructure, invisible and free.
The Polyglot Internet
, by Ethan Zuckerman of Global Voices