Is ChatGPT the future, and is it a future that we can trust?

What’s old is new and what is new is old … the world is changing so fast that sometimes you just need to take a timeout and play some games and visit the Fairgo casino login and CHILL…

Is ChatGPT the only option?  What about ModernMT for translations?

ModernMT is an AI translation program that is available for download through github:  https://github.com/modernmt/modernmt

I have been looking at different translation programs that are using AI: Google Translate, Reverso, and ChatGPT.

When using a translation pair of say English to Spanish and Spanish to English, there is enough data out there that has been checked and verified that you can trust the results that ChatGPT and others provide.

But when you are using a translation pair that is not as popular, for example, English to Hebrew and Hebrew to English, the results are not what you would expect (aka there are errors that even a newbie can see).

Essentially, it becomes a situation of the “blind leading the blind”.  This is caused by the fact that every single AI system is using the EXACT same training data set called “Opus”.  (https://opus.nlpl.eu/)

The problem with Hebrew-English is that almost all of the data in Opus is English to Hebrew translations with no human checking.  So it becomes errors on top of errors on top of errors on top of errors.  The end result is a mess.  Or using the technical terms that University Professors in computer science use “garbage in equals garbage out”.  If the training data is “crap”, whatever comes out of it is also going to be “crap”.

ModernMT is an open source project that allows the user to store their training data on their own computer in a directory. So you are training AI with data from example texts that actually reflect what you are trying to do.  ModernMT also has an online service where you can start with ModernMT’s training data and then add your own training data on top of that.  It depends on your project and if starting with the Opus training data is a good starting point.

With my testing of English to Hebrew and Hebrew to English, there were several problems.

Problem 1, there is no way for an AI system to understand when an article is talking about Biblical Hebrew or Modern Hebrew.  A lot of Christians like to use Biblical passages as Hebrew text examples.  This has positives and negatives, but when dealing with AI, it causes problems.  The article is dealing with Modern Hebrew, but the examples are dealing with biblical Hebrew.

Not to mention there are numerous ways to spell words.  With nikud, without nikud, with long spelling, with short spelling, with full nikud, with only nikud that will affect pronunciation.  A human can tell that all of those are the same word, but unless an AI is specifically told this, they will have no idea that is the case.

Finally, the data is skewed.  You can have a billion sentences in your dataset, but if they are the “wrong” sentences, your data is still garbage.

Here are some language statistics

How many words are in the English language?  1 million words.

  • 171,146 words currently in use in the English language
  • 47,156 obsolete words.
  • native speakers know 15,000 to 20,000 word families – or lemmas – in their first language.

But when you look at the words that most people use on a regular basis.

  • 220 words appear in around 50% of the text (at a minimum) (Dolch Words)
  • 1000 words appear in 80 to 90% of text aimed at Grade 8 or below (what most internet websites and newspapers are written to). (Fry Words)
  • 875 words for 90% coverage of children’s books and even higher for children’s TV shows and YouTube videos.

Here are some other wordlists:

  • 2800 words 87 – 94% of adult reading material (NGSL)
  • +1200 words TOEIC gets to 99% (TSL)
  • 960 additional academic words (NAWL)
  • 1700 business words (BWL)
  • 600 fitness words (FWL)

There is also different specialty lists for middle school and high school vocabulary by subjects

  • English
  • Math
  • Science
  • History
  • Geography
  • Art
  • Music

The bottom line is that a person can have a strong foundation with a vocabulary of up to 5000 words, and with a vocabulary of 10000 words, you are reaching native speaker levels.

When an AI program just randomly grabs data, even though it has a data set of over a billion sentences if it was never provided sentences with the proper use of “girl” and “puppy” (two very basic examples), a results that are returned can either be really messed up to a level that can easily be recognized as a mistake.  Or more likely, it can be close, but still not right … which can cause other problems, because mistakes can be missed.  People end up trusting a system that is not at a level where it should be trusted.

Can AI be trusted for non-translation purposes?

This is not a simple answer.  Where is the AI getting the data from?  Can you trust those sources?  Is it even telling you the truth or is the AI making things up?

Where does ChatGPT get its data from?

This is the answer that I got back from ChatGPT?

ChatGPT, as an AI language model, has been trained on a diverse range of data sources from the internet. It has been trained on a mixture of licensed data, data created by human trainers, and publicly available text from sources such as books, websites, and other texts, to develop a broad understanding of human language.

It does not have direct access to specific databases or the ability to browse the internet. It is important to note that while ChatGPT strives to provide accurate and up-to-date information, it may not always reflect the most current news or developments, as its training data goes up until September 2021 and it does not have access to real-time information.

Here are my questions:

  1. A diverse range of data sources from the internet — The internet is a huge place.  That could mean Encyclopedia Brittanica (a source most people would trust) to comments to this very article, data that nobody would ever trust.
  2. ChatGPT has been trained on a mixture of licensed data, data created by human trainers, and publicly available text from sources such as books, websites, and other texts — Again, licensed data could be Encyclopedia Brittanica or the National Enquirer.  Everybody trust Encyclopedia Brittanica, but unless you are working for “Men In Black”, most people do not trust the National Equirer.
  3. Books, websites, and other texts — Books in the public domain, but it will include recently published books due to copyright infringements.  Websites can be anything from the CDC website to a child’s “My Favorite Chemistry” website.  “Other texts”?  Again, that can be Wikipedia (with all of its flaws) or comments in a blog.

ChatGPT 4 (which people have to pay to use) does have access to real time information.  But the free version that most people use is ChatGPT 3.5, which only has data up until 2021.

But at the end of the day, people blindly accept the data that is being returned by ChatGPT when in reality, a person has zero idea of where exactly ChatGPT is getting its data.  When you want to find a chocolate chip full pan recipe, that may be fine.  Your 6 year old does not care where exactly the recipe came from.  But when you are talking about the recipe for making superglue, you want some trust that the results are actually correct.