Creating an AI (step 1: NLP)

Mark Lyck
4 min readNov 12, 2016

A little pre-story:

I recently set out to create a chatbot from scratch.

There are plenty of chatbots out there like Siri, Corsara, Alexa etc. But these all have 1 major thing in common. They are geared towards being “assistants”. Thus, they pretty much only do question / answer type chat, and commands. Like “turn on the light” or “what’s the weather like?”. Useful for sure. But it’s not very “human-like”.

It’s been a dream of mine to create a chat-bot geared towards, emotions and human interaction for quite some time. But you have to start somewhere.

The first major goal of many, is simply to qualify for the Loebner prize.

To qualify the bot must easily handle the following question types:

My name is Bill. What is your name?
How many letters are there in the name Bill?
How many letters are there in my name?
Which is larger, an apple or a watermelon?
How much is 3 + 2?
How much is three plus two?
What is my name?
If John is taller than Mary, who is the shorter?
If it were 3:15 AM now, what time would it be in 60 minutes?
My friend John likes to fish for trout. What does John like to fish for?
What number comes after seventeen?
What is the name of my friend who fishes for trout?
What would I use to put a nail into a wall?
What is the 3rd letter in the alphabet?
What time is it now?

However before we can start building an advanced emotional AI, we’ll have to start at step 1. Natural language processing. A chat bot isn’t much if it can’t understand what we’re trying to say to it!

Cleaning up your messy language!

Before we start analyzing the text, it helps a lot to clean and normalize it. This process is very simple, yet time-consuming to create. Luckily the internet exists and amazing people have already created text files for these things. Such as

  • translating British and Canadian english into American english.
  • Correcting over 4600 often misspelled words.
  • translate texting contractions such as ‘cuz’ to ‘because’
  • and normalize the text. Translating a lot synonyms and sayings into a consistent outcome.
  • numbers such as ‘seventeen’ to ‘17’

I decided to create an NPM package out of this called chatbot-normalizer

After we got a normalized sentence, I burst out punctuation. (this makes it easier to deal with the words in the message)

"My name is Bill." becomes:"My name is Bill ."

Then I tag all the words with a Parts of Speech tagger.

taggedWords: 
[
['My','PRP$'], // Personal pronoun
['name','NN'], // Noun
['is','VBZ'], // Verb present
['Bill','NNP] // Proper noun
]

I also put all the words in broader tagged arrays

names: ["Bill"],
nouns: ["name"],
numbers: [],
verbs: ["is"],
adverbs: [],

We should also figure out if a question was asked, and if so, what kind. For this I would like to write a machine learning algorithm, and train it with TREC’s dataset, containing classification of over 4000 questions.

However, to start off with. I’ll be utilizing the Qtypes library. To classify questions. For anyone interested Here is the full list of question classes.

qType tells us what kind of reply the user expects, qSubType well tell us what the question format is, this is one of: CH, WH, YN and TG:

  • CH: Choice or Alternate Question. The question is asking you to pick between two or more things. For example: Is the water hot or cold?
  • WH: Question are the most common, they come in the form of who, what, where, when or why.
  • YN: Yes/No Question are fairly self explanation. For example: Do you have a pencil?
  • TG: Tag Question are not really questions, but are ways of asking questions to keep the conversation open. They are statements that usually end in a pronoun, and add a positive or negative ending, for example: It’s beautiful, isn’t it? or Sally went to the store, didn’t she?

Let’s look at a small example output from this:

>> What time did the train leave London?isQuestion: true
qtype: 'NUM:date'
qSubType: 'WH'

We now know, that a question was asked, and it’s expecting an answer in the form of a date. Our parts of speech tagger tells us, that it’s in past tense. And gives us the noun “train”, verb “leave”, and name “London”.

From this we can almost guess what the user is trying to ask us. But there’s still plenty more that can be done.

I also implemented a units parser. Finding units in text.

>> How many steps am I from Colorado?units: ['steps']

I pull out dates and time:

>> Remind me to pay rent at 4:10pm tomorrowdates: [November 12, 2016],
times: ['16:10']

I find out wether it’s a command or not.

>> Set an alarm for 8 am.isCommand: true

After this, I decided to move on to a reasoning system. And work on modules to handle simple questions. But there are still plenty of improvements that can be made to the Natural Language Processing.

The next thing I will be implementing will most likely be an emotion evaluator. To figure out how your message ranks on emotional levels such as: joy, anger, fear, sadness etc. This will be a strong factor in how we’re going to make the bot replies, feel more human like later on.

--

--