Not all languages are created equal

PLUS Tech giants are finally paying for data

Apr 10, 2024

Language is at the centre of the AI boom and that essentially means English (with a dash of French) are being prioritised. Minority languages, particularly in Africa, are in danger of being left behind. Journalist, iHubOnline Founder and general nice guy Mallick Mnela has become an ambassador for AI in Malawi. And he is increasingly passionate about inserting African languages into these monstrous, English dominated LLMs.

I wanted to know from Mallick if this problem of language representation in AI is solvable? Or are the resources needed to make a difference too massive?

Mallick has been in journalism for 20 years, but quit his job in 2019 to be an entrepreneur. I met him in October last year in Namibia and he was inhaling every AI development, app and piece of code he could find. In Malawi, he says, the majority of the population is digitally excluded. “Most indigenous knowledge is not properly documented and even less is published online,” he says. Still, a few people in Malawi are pushing the AI sector forward.

He’s busy working with Microsoft to make the Malawian language Chichewa better represented in the AI ecosystem. He built a Chichewa version of ChatGPT and bundled it with “reinforced learning”. This is where a user sends feedback if they aren’t satisfied with the answer and a human will check the interaction and make changes where necessary. It sort of crowdsources the tweaks needed to improve the model.

Mallick found that religious Chichewa content was generously available online (and there were naturally plenty of available translations to stack it next to). So, he used this to train his model and it meant that a bias manifested: the LLM used more “biblical” language. But it also turned the LLM into being more tolerant and philosophical, he says.

The big problem is that if we don’t start training these models on niche languages now, we aren’t going to be able to develop them as easily in the future as the tech accelerates. “For English, the LLMs perform well in handling complicated language queries. But they can be pretentious and give false information that appears real,” says Mallick. To stop this from happening Retrieval-Augmented Generation (RAG) is used. This gives the LLM context and lowers the chance of it spewing nonsense. However, if the language is low resourced (like Chichewa) then you can’t augment it in this way. “The model just produces chaos when you try,” says Mallick.

And lastly who should pay for all this?

“Media companies may not have the money but they have the data. I therefore strongly propose a collaborative approach,” says Mallick. “The best idea is to forge partnerships so that the media can provide the much needed datasets while big tech focuses on the technology.”

If a news outlet has been publishing bilingual news daily for years then it means they already have an incredibly rich dataset. They have masses of sentence pairs needed to train a custom model. And this model has far reaching applications (beyond media) that could create a revenue stream for a newsroom.

“We need to start looking at how our newsrooms are structured,” says Mallick. “We need to locate the tech savvy journalists and empower them to provide a bridge between newsrooms and tech spaces.”

As Mallick points out, the big tech companies need to bring as many languages as possible along with them as this is their future user base. However, they are going to do this with or without the cooperation of the media and content creators. “As long as journalists focus on generating content, AI companies will be scraping it from their platforms with little or no acknowledgement at all,” says Mallick. It is better for the media to be actively involved, generate cash and produce a far better product.

This week’s AI tool for people to use…

I demonstrated this tool last week during my class on “Data Visualisation and AI” for The British University in Egypt. Google Looker Studio (previously known as Google Data Studio) is the Google Docs of data visualisation. It integrates with all your other Google tools and lets you quickly turn your data into a cool looking graph or dashboard.

What AI was used in creating this newsletter?

I used ChatGPT for the image above. It took over a dozen tries to remove African beads from the picture. They would either be on the man’s neck or wrists. Even when ChatGPT said that it had removed the beads they were still in the picture. And the AI kept complaining that it couldn’t see or review what it had produced… like it was caught in a kind of creative torture.

In the news…

The bad: AI and jobs. The Hard Fork podcast, with Kevin Roose and Casey Newton, has that insufferable chirpy tone that is a characteristic of podcasts from The New York Times, but if you can get through that then there are some gems in this episode. They take a solid look at what AI is doing to jobs and the economy in the US.
The good: people are being paid. A new report from Reuters shows how tech giants are racing to secure vast quantities of online data to feed their AI models. And they are actually paying this time. Meta, Google, Amazon, and Apple all reached a deal with Shutterstock in 2022. The agreement included hundreds of millions of images, videos and music files for AI training. Prices for training data range from a few cents per image to hundreds of dollars per hour of video. If you are a data heavy platform it is time to cash out.

What is happening at Develop Audio?

Our sister company does investigative podcast production and training. We’re excited to share with you their new podcast series, Asylum (it is available on the Alibi feed, as season 3, on Apple and Spotify).

Investigative journalist Opoka p'Arop Otto has asylum in The Netherlands. He was forced to flee his country of South Sudan because he was in danger of being killed. The incredible journalism he produced about the situation in his country meant be could no longer live there. He is now rebuilding his life in Europe, forced to work as a cleaner in order to make money for his family... and yet he is still helping other journalists back in South Sudan as they fight to survive. A huge thanks goes to The Pulitzer Center for supporting this series. (Those links again: Spotify and Apple).

What is happening at Develop AI?

I was very happy to see the announcement of the finalists for the Digital Media Awards by The World Association of News Publishers (WAN-IFRA). I served as a judge for these awards in the “Best Use of AI in the Newsroom” and “Best Podcast” categories. Congratulations to all the people and organisations that are up for awards!

See you next week. All the best,

Develop Al is an innovative company that reports on AI, builds AI focused projects and provides training on how to use AI responsibly.

Check out Develop AI’s press and conference appearances.
Listen to Burn It Down, our completely AI generated podcast (and ask us to make you one of your own).
Also, look at our training workshops (and see how your team could benefit from being trained in using AI).

This newsletter is syndicated to millions of people on the Daily Maverick.

Email me directly on paul@developai.co.za. Or find me on our WhatsApp Community.

Physically we are based in Cape Town, South Africa.

If you aren’t subscribed to this newsletter, click here.

Develop AI by Paul McNally

Discussion about this post