February 28, 2023

Google Bard AI: Everything that You Must Know!

What is Google Bard? What did Google Bard do? What question did Bard get wrong? What questions did Google AT get wrong? What is Bard Chatbot? What sites were trained on it?

These are a few of the questions that are circulating in the AI industry.

Google Bard is trained on web content, but how the information is collected and which content is used is something that every user should know.

Let's start answering every question about Google Bard.

What is Google Bard?

Google Bard AI: Everything that You Must Know!: eAskme
Google Bard AI: Everything that You Must Know!: eAskme

Google's Bard is an AI chatbot like ChatGPT. Google users can use Bard during conversations.

Bard is developed on LaMDA (Language Model for Dialogue Application).

Infiniset is the content on which Bard is trained. Till now, there has been very little information revealed about Infiniset.

It is too early to say where Google LaMDA has collected data and how.

The 2022 LaMDA report shows that 12.5% of data came from Wikipedia and 12.5% from public datasets.

Even though Google is not revealing where the company has collected the data, there are some sites that industry experts are talking about.

What is Google's Infiniset Dataset?

Google Bard is based on Language Model for Dialogue Applications, also known as LaMDA.

Google's language model trained on the Infiniset dataset.

Infiniset is the data collected to improve the LaMDA's ability to boost conversation.

LaMDA research paper https://arxiv.org/pdf/2201.08239.pdf shows that the whole process focused on boosting dialogs or dialog.

1.56 trillion words from public data are used to pre-train LaMDA.

The research paper has revealed that data is from the following sources:

  • 12.5% of data is C4 based.
  • 12.5% data is from Wikipedia.
  • 12.5% of data is from tutorials, websites, etc.
  • 6.25% of data is from English documents.
  • 6.25% of data is from Non-English documents.
  • 50% of dialogs data is from public forums.

We already know that 25% of data is from Wikipedia and C4.

The C4 dataset is a common crawl dataset.

It also means that 75% of Infiniset data is from the internet.

What pdf documents had not revealed is how the data was collected.

Google has not explained what it means by "Non-English web documents.

That is why the rest of the 75% data is known as Murky.

C4 Dataset:

Google developed the C4 dataset in 2020.

All the data used in C4 is open-source common crawl data.

What is the Common Crawl?

CommonCrawl is a free-to-use website that creates free datasets for internet users. It is a non-profit organization.

The Common Crawl founders are from Blekko, Wikimedia, and Googler.

How has Google developed the C4 dataset from the Common Crawl?

The company has cleaned Common Crawl data such as deduplication, thin content, lorem ipsum, obscene words, navigational menus, etc.

C4 has collected only primary data and removed meaningless content.

But it doesn't mean that you cannot find unfiltered C4 datasets.

Here is the C4 dataset research paper:

The second document shows that 32% of Hispanic and 42% of African-American pages were removed during filtration.

51.3% of data is from sites hosted in the United States.

The C4 dataset uses using following sites such as:

  • www.npr.org
  • www.ncbi.nlm.nih.gov
  • caselaw.findlaw.com
  • www.kickstarter.com
  • www.theatlantic.com
  • link.springer.com
  • www.booking.com
  • www.chicagotribune.com
  • www.aljazeera.com
  • www.businessinsider.com
  • www.frontiersin.org
  • ipfs.io
  • www.fool.com
  • www.washingtonpost.com
  • patents.com
  • www.scribd.com
  • journals.plos.org
  • www.forbes.com
  • www.huffpost.com
  • patents.google.com
  • www.nytimes.com
  • www.latimes.com
  • www.theguardian.com
  • en.m.wikipedia.org
  • en.wikipedia.org

Top-level domain extensions used in the C4 dataset are:

  • Com
  • Org
  • Co.uk
  • Net
  • Com.au
  • Edu
  • Ca
  • Info
  • Org.uk
  • In
  • Gov
  • Eu
  • De
  • Tk
  • Co
  • Co.za
  • Us
  • Ie
  • Co.nz
  • Ac.uk
  • Ru
  • Nl
  • Io
  • Me
  • It

Here is what was published in the 2020 research paper. https://arxiv.org/pdf/2104.08758.pdf

What is Dialogs Data from Public Forums?

Google's LaMDA uses 50% of data from "Dialogs Data from Public Forums."

It is best to say that communities like StackOverflow and Reddit are used in many datasets.

Google has also talked about MassiveWeb. You should know that MassiveWeb is Google's product.

MassiveWeb uses data from:

  • StackOverflow
  • Reddit
  • Medium
  • Facebook
  • YouTube
  • Quora

But no one can surely tell if this data is used for LaMDA.

Remaining Data:

The remaining data is from:

  • 6.25% from Non-English web documents.
  • 6.25% from English web documents.
  • 12.5% data is from Wikipedia.
  • 12.5% is from code documents sites.

What did Google Bard do?

Google has launched Bard as an answer to compete with Microsoft's ChatGPT chatbot.

But most recently, Bard has delivered errors during its search demo. This issue has caused a $100 billion loss in Alphabet shares.


Google Bard is Google's effort to compete with ChatGPT and AI chatbot technologies. The current demo of Bard has caused a massive fall in Google's parent company Alphabet's shares.

It also shows that a lot of work still needs to be done to fix errors and make Bard ready for the future.

There will be more news coming out soon about Bard.

Stay tuned with us.

Share your thoughts via comments.

Don’t forget to share it with your friends and family.


Because, Sharing is Caring!

Don't forget to like us FB and join the eAskme newsletter to stay tuned with us.

Other handpicked guides for you;

Man Behind eAskme

Gaurav Kumar

Gaurav Kumar is the founder of eAskme.com. He is the professional blogger, writer, motivational speaker and online. He the man behind "Blogging for money guide" and "complete domain name guide". eAskme will help you to become an online entrepreneur. You can learn SEO, Money MAKING, SEO, blogging and more.

More About Me