Predicting Fraudulent Job Postings

8 min readApr 21, 2024

Fraudulent job postings pose a significant problem in today’s digital age. They’re not just annoying, but can also lead to some serious issues like scams, identity theft, and even financial loss. Plus, they can really put a damper on your job search spirit.

So, what can we do to get better at predicting whether a job posting is genuine or a scam? The answer lies in data science! By using data on job postings, we can gain invaluable insights on the relationships between different factors, like education level and salary ranges and whether the job turned out to be fraudulent or not.

For this purpose, we’ll be using the Real/Fake Job Posting Prediction dataset from Kaggle. This dataset contains around 18000 job descriptions out of which 866 are flagged as being fraudulent.

The dataset contains 17 features (detailed below), each consisting of either textual, numerical, and binary data types. The column that interests us the most is the fraudulent column, in which a value of 0 indicates a real job posting and a value of 1 indicates a fake one.

Research Questions

Before we start analyzing the data, it’s important to define what exactly we’re looking for. While such rich data has the potential to reveal lots of interesting insights, there are 5 questions that we’d be looking forward to answering in our Exploratory Data Analysis (EDA).

Research Question 1: Are fraudulent job postings more common in certain locations or departments?

Research Question 2: Are there specific phrases or keywords in job descriptions that are common in fraudulent postings?

Research Question 3: Which industries are the victims of the most fake job postings?

Research Question 4: Are fraudulent jobs less likely to have a company logo?

Research Question 5: Are certain employment types, experience levels, or education levels more common in fraudulent postings?

Data Cleaning & Preprocessing

It’s not uncommon to find a dataset with missing values, and upon quick inspection, we found out that our dataset is no exception. Here’s a visualization of the spread of the NaN values across the columns of the dataset.

Since the department and salary_range columns have a very high number of missing values, we decided to drop them. This decision has the potential drawback of losing potentially useful information regarding the prevalence of fraudulent activities in certain departments or salary ranges. However, the removal of these columns does open up room for more reliable information from other variables without cutting off too much of the data.

Interestingly, the number of legitimate job postings outweigh the fraudulent ones by quite a margin.

Quite the disparity, right? There’s absolutely no way we can draw unbiased conclusions from a dataset with such a high class imbalance. Therefore, our next step was to randomly cut down on the Legitimate job postings until we have a more balanced dataset.

We also created a function to help us prepare the raw text data for machine learning (ML). It takes in raw text and applies several cleaning steps to make the text easier for a machine learning model to understand. Here’s how the function cleans the data:

Removes numbers and converts all text to lowercase.
Removes common abbreviations, symbols, web addresses, and email addresses.
Replaces newline characters with spaces and removes text within square brackets.
Removes punctuation from the text.
Normalizes word contractions, such as changing “I’ve” to “I have”.
Removes HTML tags from the text
Tokenizes the text (splits it into individual words), removes common words (stop words), and applies stemming (reduces words to their root form).

Now, with the data neatened up, we can finally move to drawing conclusions from it.

Exploratory Data Analysis (EDA)

EDA is an essential step in the data science process that allows us to understand the data’s structure, identify any anomalies or outliers, and spot patterns or trends. It’s during this stage that we clean our data, dealing with missing or incorrect entries that could skew our predictions.

First of all, it’s important to know what variables are positively or negatively correlated with each other. To achieve this, we visualized the correlation between the numerical columns in the DataFrame using a heatmap.

Interestingly, according to the heatmap, most correlations appear weak, as indicated by the dominance of the lighter purple color, suggesting that there’s no strong correlation between these pairs of variables. However, the feature has_company_logo and has_questions have a noticeable positive correlation of 0.23. This suggests that job postings with a company logo are more likely to have questions in the application process.

Moreover, another interesting finding is that Australia has a significantly higher proportion of fraudulent job postings compared to the other countries.

Let’s now have look at a couple of visualizations on fraudulent job postings from the perspectives of different features:

In a nutshell, fraudulent job postings are most commonly associated with full-time roles that are entry-level, mostly requiring high school or equivalent education level, often in the fields of administration and engineering. Fake postings also tend to lack a company logo and are really popular in the oil & energy industry.

We also managed to extract the top 10 keywords that fake job postings apparently love to use:

Now that all our research questions have been answered, there’s one final finding that we believe is worth sharing. According to the below box and whisker plot, fraudulent job postings tend to have a fewer character count compared to legitimate job postings. One would’ve thought that fake recruiters would put in extra effort in making their postings as believable as possible. But apparently, they tend to be more conservative on the amount of time they spend on their keyboards.

Classification

With the rise of online job portals, it has become increasingly simple for malicious entities to post fake job listings with the intent of scamming some poor, unsuspecting job seekers. Therefore, using existing data, it’s imperative for us to develop robust classifiers that could root out any fake job postings.

Using our limited dataset, we’ve trained a total of 5 models for this classification task.

1. Logistic Regression

A simple, yet powerful algorithm for binary classification, logistic regression is often the first choice for any task involving classification. Naturally, it was our first pick as well. Here are the model results:

It’s important to remember that we’re working with a balanced dataset, which came at the cost of dropping tens of thousands of entries. Therefore, it’s expected to get a lower accuracy value with logistic regression. Using the original (imbalanced) dataset, we were able to obtain a whopping accuracy of 96%.

2. Random Forest Classifier

Random Forest is an ensemble method that provides a more robust prediction by combining the predictions of multiple decision trees. Known for its high accuracy and ability to handle imbalance datasets, this proved to be the perfect fit (no pun intended) for us.

First we used the entropy criterion:

Then we trained the Random Forest classifier using the Gini Impurity criteria:

Random Forest classifier with Gini Impurity criterion gave us a really impressive 90% accuracy.

3. Support Vector Machine Classifier

SVM is yet another powerful and flexible classifier which is commonly employed while handling high-dimensional data. We believe that it’s particularly suited for this task for this reason. Here are results:

4. Multinomial Naive Bayes Classifier

Multinomial NB is a probabilistic learning method that is widely used in Natural Language Processing (NLP). The algorithm is based on the Bayes theorem and uses the frequency of words associated with each tag in the training data to predict the tag of a new text, making it a classifier worth trying for a text classification problem like ours.

5. K-Nearest Neighbors Classifier

For non-text data, K-Nearest Neighbors (KNN) was used. KNN is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure, making it a good baseline for classification tasks. For us, this is the only model that we trained on non-text data.

Summary

The Random Forest (Gini) model has the highest accuracy and it also shows a good balance between precision, recall, and F1 score, which suggests that it’s the most reliable model among the ones we’ve tested for the task of classifying fake job postings.

You can find the .ipynb notebook with the relevant code and visualization here.