Dataset For Resume NER (Named Entity Recognition)
So to reduce the time for this process, recruiters follow two kinds of ways
- They take the top few resumes from the whole set of resumes and select desired candidates from that.
- Another way is reviewing all resumes but spend less time and focus on limited sections of every resume.
In both cases, recruiters don’t get the desired candidate’s resume effectively. Because in the first case, they may lose skilled persons in the remaining set of resumes. In the second case, recruiters may not focus on all essential fields in every resume.
To avoid this issue in the HR field, we need to automate this process to focus on more important sections in less time. How can we automate the resume screening process? By using Natural Language Processing techniques, we can create a model/system to automate the process. If we want to build an accurate and effective model, we need a proper Resume Named Entity Recognition annotated dataset. You can get such a proper dataset at Predictly.
Details about the Resume NER dataset:-
- Input Type :- Text
- Number of data points/resumes:- 2500
- Number of entities / labels :- 13
- Labels:- Name, Phone Number, Email, Skills, School Name, College Name, Degree, Major, Location, Year of Passing, Grade, Organization Name, Years of Experience.
Where and How to Collect the Data?
Before you create a Resume NER dataset, we need to know what kind of data and labels/classes we need. Here we need a bunch of resumes. Nowadays, candidates upload their resumes to organization/company websites. All resumes are stored in organization databases; we can extract or collect most of the data by using web scraping techniques. We can get the required amount of resumes from various companies’ job portals. After getting a resumes list, we need to extract the text effectively from each and every resume.
Methods:- Web Scraping, Data Collection, Data Extraction, Data Storage, Data Management, Data Preprocessing.
Technologies/Libraries used :- Python, Pandas, Selenium, BeautifulSoup, Requests, JSON, CSV, PyPDF2, Docx.
Every organization or domain skills, education details, job requirements are different from other domains/organizations. For example, software skill sets are different from healthcare domain skills. So we need to scrape resumes based on organization domain/ job requirements.
And one more thing is, not every job portal or organization does not give their resumes to others. So we need to do research on that and find the best sites for your required resumes.
Another important task is extracting data from resumes. Resumes are in the different formats such as .pdf, .docx. We need to apply different techniques on resumes to extract data based on resume format. We can perform this task by using Python, PyPDF2, Docx libraries.
After extracting data, we need to store that data for further processing tasks and apply a few preprocessing techniques to better qualitative data.
How to Build a Resume NER Dataset?
Methods: Data Labeling, Data Visualization, Model Development, Machine Learning/Deep Learning, Model Evaluation, Word Embeddings(Glove, FastText etc), Active Learning
Technology/Library used: Python, CSV, JSON, Regex, Pytorch, Numpy, TensorBoard, Fast.ai, Scikit-Learn, Matplotlib, Seaborn, and Predictly Text Annotation Platform
Here you will know How Predictly performs different tasks to create Resume NER dataset effectively?
- After collecting a sufficient number of resumes from various resources, we need to extract text data from different resume formats. Candidates mention their details in resumes in different formats such as tabular format data, plain text. We must and should extract every text data from resumes because every single point is very important.
- We need to apply text preprocessing techniques on extracted data to remove noise, such as removing extra spaces, punctuations, etc., and then storing preprocessed data.
- Our data is ready to turn into a dataset. For this, we predict named entities for 20% of data by using the Pre-Trained Resume NER model. Results are sent to the annotators’ team for checking if all entities are correctly predicted or not? If modifications are required, are there any text need entity or any text incorrectly predicted? The annotators’ team did change.
- We take this 20% of annotated data, build a natural language processing model, and apply it to 10% of unseen data to predict entities.
- And again, this 10% data was sent to the annotators’ team for cross-checking and combining the previous 20% data and this new 10% data, and again training a model using a total of 30% of annotated data. We need to repeat this process until complete annotation on 100% data.
- After completing the 100% annotation process, we build a model like a Resume Screening to apply to real-world HR issues. For this model, we use the BERT state-of-the-art model to train the resume screening model and predict the required entities. We also use ElasticSearch technology to fetch the desired number of resumes based on filter skills and related skills.
Where Can We Use the Resume NER Dataset?
Using this Resume NER dataset, we can predict resume named entities such as personal details (name, email, phone number), education details, skills, etc. from every uploaded Resume.
Here’s how the typical resume screening system will look like:
Upload one or more resumes and select filters such as a list of required skills, experience, education qualification, etc.
- After uploading resumes, we are extracting data from each resume and applying text preprocessing techniques.
- We send this preprocessed data through our trained Resume NER model to predict named entities such as name, email, skills, experience, etc. from each resume data.
- Now we have to apply the ElasticSearch method to fetch qualified resumes based on our filters.
- Here we can also get resumes with corresponding related skills of filter skills from uploaded resumes with corresponding download links of each qualified resume.