What is Natural Language Processing?
Natural Language Processing (NLP) is a branch of Artificial Intelligence/Computer Science that focuses on how computers can be programmed to understand, interpret and process huge volumes of human (natural) language. Human language is quite complex and ambiguous, making it difficult for computers to understand. Multiple techniques exist to process such language, from algorithmic approaches to statistical methods and machine learning.
NLP can be used in various ways, a few well known applications are sentiment analysis, chatbots and summarizing and translating texts. Processing text can be quite difficult for machines. Similar words can mean totally different things or different words can have the same meaning. We can define a few linguistic features that enable NLP algorithms to make more accurate predictions about semantics.
Tokenisation
Tokenisation is the process of segmenting text into words, sub-words, punctuation etc. These smaller segments are called tokens and are usually considered the building block of Natural Language. There are specific rules for each language, e.g. the U.S.A. should be considered as one token whereas punctuation at the end of a sentence should be considered as another.
Consider the following sentence: “Let’s go to N.Y.!”
Tokenizing this text would happen as follows: