Matching resumes with job offers using spaCy, a Natural Language Processing (NLP) library in PythonMatching resumes with job offers using spaCy, a Natural Language Processing (NLP) library in PythonMatching resumes with job offers using spaCy, a Natural Language Processing (NLP) library in PythonMatching resumes with job offers using spaCy, a Natural Language Processing (NLP) library in Python
  • Interim specialisten
  • Vacatures
  • Blog
  • Over ons
Matching resumes with job offers using spaCy Natural Language Processing (NLP) library in Python

Matching resumes with job offers using spaCy Natural Language Processing (NLP) library in Python

Matching resumes with job offers using spaCy

Recruitment agencies easily have to match hundreds of candidates with hundreds of job listings a month. It is a very time-consuming process to manually sift through such a pile of documents. There has to be a way to do this more efficiently, right? Well, luckily there exists a multitude of NLP open-source libraries for Python that can help us speed up this process and shortlist several candidates for these job listings.

Matching resumes with job offers using spaCy Natural Language Processing (NLP) library in Python

What is Natural Language Processing?

Natural Language Processing (NLP) is a branch of Artificial Intelligence/Computer Science that focuses on how computers can be programmed to understand, interpret and process huge volumes of human (natural) language. Human language is quite complex and ambiguous, making it difficult for computers to understand. Multiple techniques exist to process such language, from algorithmic approaches to statistical methods and machine learning.

NLP can be used in various ways, a few well known applications are sentiment analysis, chatbots and summarizing and translating texts. Processing text can be quite difficult for machines. Similar words can mean totally different things or different words can have the same meaning. We can define a few linguistic features that enable NLP algorithms to make more accurate predictions about semantics.

Tokenisation

Tokenisation is the process of segmenting text into words, sub-words, punctuation etc. These smaller segments are called tokens and are usually considered the building block of Natural Language. There are specific rules for each language, e.g. the U.S.A. should be considered as one token whereas punctuation at the end of a sentence should be considered as another.

Consider the following sentence: “Let’s go to N.Y.!”

Tokenizing this text would happen as follows:

spaCy tokenisation
spaCy tokenisation - https://spacy.io/usage/linguistic-features#tokenization

Part-of-Speech tagging

Part-of-Speech tagging or grammatical tagging is identifying a word based on its definition and context and classifying words on their grammatical properties (e.g. verb, noun, adjective etc.) Although this seems simple in theory, it can be quite difficult to correctly tag ambiguous words such as ‘fly’ which can either be a noun (as in animal) or a verb (as in ‘to fly’). SpaCy uses a statistical model that makes a prediction of which label or tag most likely applies in the given context.

Lemmatisation

Lemmatisation reduces inflectional and related forms of a word to its common base, also known as a ‘lemma’. This allows for groups of words to be analysed as a single item. For example:

  • am, is, are → be
  • run, ran, running → run

Named Entity Recognition

Named Entity Recognition (NER) is a form of information extraction that locates important information (entities) in text. These entities van vary from a person to an IP address, country or phone number. Nowadays, machine learning models are able to accurately extract these entities from large bodies of texts using convolutional neural networks. In this project I will use spaCy’s NER system to extract certain skills from job listings and resume texts.

Resume matching

So, back to how we can use NLP and spaCy to match resumes to tech related job listings. First of all, we would need to be able to read in resumes either in PDF or Word format. Luckily, Python has two libraries, PyPDF2 and textract, to do just that.

Now that we have two functions that are able to read in resumes in PDF or Word format, these texts need to be tokenised in order for spaCy to work with it.

This function will read in all PDFs or all Word files from a specified directory, extract the candidate names from the filename and tokenise the texts before appending it to a list.

Updating the NER system

It is time to update spaCy’s NER system in order to enable it to recognise skills. I found this great Source that has a JSONL file with over 2000 skill patterns that I imported after slightly modifying the file. SpaCy parses the texts and will look for the patterns as specified in the file and label these patterns according to their ‘label’ value. It also comes with a pretty visualizer to show what the NER system has labelled.

For the example below I imported an example resume. And following a screenshot of the NER output.

Now that we are able to extract the skills from a body of text, it’s possible to do this for multiple resumes and job listings and match them according to their skills. For the coming example I took a random Data Science job listing from Indeed, but this data could also be acquired through other means (e.g. web scraping, data files, etc.)

First of all, we create a dictionary containing the candidate name as key and their skills as value. Making use of Python Sets afterwards, we can easily identify the intersection between candidate’s skills and the required skills extracted from the job listing. An example with 6 resumes outputs the following:

Visualising these results in a bar graph provides a better overview of the results.

Bar plot NLP resume matcher

Final Thoughts

The purpose of this article was to show a potential solution/helping tool, to a problem many recruiters might experience. While this method of parsing and matching resumes is obviously not fail-proof and hard-skills are not the only way to evaluate potential candidates, it might save you a considerate amount of time using this script to shortlist a certain amount of candidates. A next step in this project could be summarising and extracting key information from resumes, to provide a quick overview of all the candidates.

I hope this post was able to provide you with some insights as to what can be achieved with NLP. A link to the full code can be found here or here. If you have any questions or ideas regarding this blogpost or other NLP projects we are happy to hear from you! You can reach out to me through LinkedIn or E-mail

This post was written on December 2020 by Dennis de Voogt, Data scientist at Always be learning.

Related posts

Edgar BI engineer Rabobank

Edgar BI engineer Rabobank

4 augustus 2022

Edgar over zijn opdracht bij de Rabobank als BI engineer


Read more
Data scientist Esther

Data scientist Esther

1 juni 2022

Esther over haar opdracht bij VodafoneZiggo als Product Owner & Senior Data analist


Read more
Data scientist Tim

Data scientist Tim

13 september 2021

Data scientist Tim bij Almere City FC


Read more

Always be learning B.V.

Fred. Roeskestraat 99, 1076 EE Amsterdam

Tel: +31 20 764 0945​

contact@alwaysbelearning.nl

 

 

Onderdeel van het MakerStreet collectief.

Lees hier ons Privacy statement.