By: Ralf Bovers
Data quality has become a hot topic once again in the world of people analytics. As more companies come to rely on HR data for strategic decision making, it’s recognized that the quality of this information has to be good. Yet aiming for perfectly clean data is neither possible nor necessary. What’s the best, realistic way to clean up data so that it is useful? We dig into that here.
There are several reasons that data quality has come into sharper focus lately:
Most companies are collecting an enormous amount of HR data. For example, an organization with 20,000 employees that collects 40 data fields (e.g., name, age, gender, job title) over a span of 36 months has already assembled 30 million data points.
In this ocean of data, it’s important to keep a few things in mind. First, while your goal should always be to have high-quality data, understand that data will never be perfect. Luckily, it does not need to be. Even partially clean data can be enough to extract good insights.
How much data is enough? National bureaus of statistics are experts in drawing conclusions on very limited data. Using principal mathematical statistics, they are able to give valid insights across a broad range of topics by using sample data that represents only 1% of the population. Now imagine that you have 80% data completeness. Even though you are missing 20% of your data points, you should have sufficient accuracy to draw conclusions.
Second, realize that the root causes of dirty data are problems at the collection point: repeat submissions, user error, improper data blending, to name a few. Address these from the beginning.
Lastly, choose your data battles. With millions of data points, be strategic about what information you really need. If your priority is promoting diversity and inclusion, for instance, choose the 3 most important D&I data fields and focus on cleaning them.
At Crunchr, we recommend a 4-step process for cleaning up data:
1. Set global definitions across your organization for data fields. For example, if you use a 4-tiered performance rating system (i.e., excellent, satisfactory, unsatisfactory, rating not available), ensure that all employees are rated against the same scale that uses the same categories.
Global companies may have different performance rating systems across the world. A general misconception is that they all need to conform to one global rating system. At Crunchr, we use internal dictionaries (‘enumeration tables’) to match and compare these global systems. The benefit is that regions still see their own ratings, but that at the global level, companies can identify quickly the ‘high performers’, those ready for the next career step. Note that these internal dictionaries also work for grades (where we typically see companies using global and local grades), succession terms, potential assessments, flight risk ratings, etc.
If you notice that the data fields have many different values, you might want to consider a six-sigma analysis to get to the root cause. Six-sigma is a well-known methodology for quality control. Typically, you’ll find that at data entry, the data field can be filled with ‘free text’ instead of selecting a value from a dropdown. Sometimes you also see that when data is being transferred from one system to another, there are hidden scripts that change the original data.
2. Look at data completeness. Check to what degree the data fields in your data set are complete. If your policy is that all employees with more than one year in service have a performance score, visualize for whom those data points are missing. Contact the managers to fill out their performance scores in the system of record. When you see important fields left empty, consider making this field ‘mandatory’ to fill out at data entry.
3. Spot outliers. If you are analyzing workforce aging, for instance, you may have collected a data set consisting of employee ages and dates of birth. Set up parameters for these fields, like a minimum age and a maximum age (example: minimum 16, maximum 67), and then begin flagging outliers, such as a 99-year-old employee.
Removing these outliers is one part of cleaning your data. Note that it takes an expert to label outliers – not all of them will be obvious, and data that appears to be an outlier may not actually be one.
At Crunchr, we believe that you can use a basic set of outlier detection rules (e.g., salary may never be negative) – but that you will never catch all outliers. And the more rules you make, the more false positives you will get. We advise a smarter approach to improving outlier detection, for example with machine learning.
4. Finally, look at combinations of data to spot further outliers and remove them. Take the combination of salaries and job levels, for instance. An outlier would be an employee at a senior vice president level who is earning minimum wage. Or a local employee in China whose salary is in U.S. dollars.
As mentioned before, realize that data quality will never be perfect and that you need to pick your battles. Related to the workforce strategy, think what the important metrics are to track. Determine what items of information are critical and what are optional. For the example above, maybe you just need data of birth but not age, as it is repetitive and inconsistencies between the two can lead to dirty data.
With cleaned up, reliable data, your organization can begin using people analytics to make bottom-line business decisions. Let’s say you spot a wide range of salaries within the same job title. First, your organization might decide what the appropriate salary range for this job is. Then, it can clearly see how many employees are being underpaid or overpaid – and plan accordingly.
Similarly, if you see that 35% of the top 3 management levels are occupied by women, Crunchr people analytics lets you drill down to their names. That makes it easy to learn more about them and develop similar paths for women to advance.
Companies focus on collecting a lot of HR data. Yet if that data is not maintained and managed over time, it becomes useless. A good data strategy, therefore, is to ask this: What are the million-dollar questions our company wants to answer with people analytics? Then decide what data fields will answer those questions and plan data strikes. Dedicate the next quarter to improving the quality of just a couple of data fields, and so on.
Your HR data is everywhere, but poor quality continues to be one of the biggest challenges to putting it to good use. Cleaning data is the best way to ensure that people analytics adoption scales in your organization.
A final cautionary tale: About 10 years ago, a major consulting firm was engaged by a global company in the Netherlands to improve data quality. With an army of analysts and tons of hours of data cleansing, they improved the data quality. But the entire team forgot one important action: to prevent data from becoming dirty again. Exactly three years later that data was dirty again, and all the efforts and close to €1 million in consulting fees were wasted.
Crunchr helps organizations around the world gain insights into how their workforce works. We strongly believe that these insights are necessary to navigate today’s business challenges and the future of work.
That is why we empower people analysts, HR and leadership to anticipate trends, design better people strategies and contribute to a healthy, working environment.
Want more information, a quick tour of our solutions or to see a full demo? Leave your query here and one of our team members will be in contact with you as soon as possible.