A team of computer scientists at the Johns Hopkins Center for Language and Speech Processing recently received a $10.7 million federal grant to develop technology to translate obscure languages. Currently there are only well-established translation interfaces, like Google Translate, for the 100 most commonly used languages.
The grant was awarded by the Office of the Director of National Intelligence (DNI), which oversees the U.S. intelligence community.
Philipp Koehn, a professor of computer science, will lead the research group. He has been working in machine translation for over 20 years.
“These are definitely the languages we haven’t ever built anything for,” Koehn said, “These are languages where Google Translate doesn’t necessarily have systems for because they haven’t bothered.”
The languages covered by the grant, which include Kurdish, Serbo-Croatian, Khmer, Hmong and Somali, are known as “low resource” languages. Speakers of these languages are not widely distributed across the world. “Low resource” languages also have little written material.
In an effort to expand their intelligence resources, the DNI is seeking to translate “low resource” languages more quickly to aid national security efforts, as intelligence agents often need to translate material in the languages encompassed by the grant.
Koehn added that these systems of translation could potentially be used in the event of an emergency, where people might need to help aid regions that they cannot easily communicate with.
“If something happened in these countries… you type in a query, it looks at all of the relevant documents in that foreign language and gives you an English summary,” he said.
One of the main challenges of the project is building translation systems for languages that are not typically used in writing.
“There’s very little translated text or transcribed speech, so it’s harder to build anything for those languages,” he said. “That’s the hard part of our project.”
Koehn elaborated that the lack of data made it more difficult for machine learning systems to develop translations. He explained that many of the languages they plan to work with have very few translated texts or transcribed speech, making them more difficult to work with.
However, once the data is obtained, the group will be able to develop translations using algorithms that analyze the structure, inflection and other elements of the language. The techniques to develop translations are already well established and are simply implemented on new languages. The researchers do not need to analyze speeches themselves in order to translate them.
After obtaining data for Swahili and Tagalog last week, Koehn was able to build a translation system in less than a day.
Although machine translation features in various devices, websites and social media, Koehn recalled how many were skeptical of machine translation in its early days.
“I did my PhD on [machine translation], and it’s interesting to see something that, 20 years ago, definitely did not work,” he said. “Now we’re really at a point where it’s there.”
However, Koehn acknowledged that computer translation still has its limitations, and there are nuances in language that cannot always be accounted for in their algorithms.
For this particular project, Koehn said that the majority of the money would go towards funding new PhD students and could help support as many as 11 PhD candidates and postdoctoral fellows.
“In computer science and engineering, all the PhD students get a scholarship, and that costs a lot of money,” Koehn said.
As Koehn described, this project includes many different researchers. The group consists of around 20 professors and other researchers working on this particular project.
Daniel Povey, an assistant research professor, is a member of the research team. He described the project as an opportunity to expand and test different technologies — including speech, translation and information retrieval technologies — that the group has been developing.
”We’re integrating different complex technologies… and they have to work in real-world conditions that might not match the data they were trained on,” Povey wrote in an email to The News-Letter.
Another member of the team, Kevin Duh, an assistant research professor, is enthusiastic about the project.
“Besides the emphasis on research in low-resource languages, the project’s holistic approach… will help us bring research to practice,” Duh wrote in an email to The News-Letter