Welcome to this segment on the basic principles of data quality in clinical research. I confess, data quality isn't a popular topic among researchers. It's like housecleaning. It takes a lot of time, and it can be really tedious. You'd love to pay somebody else to do it. But that costs a lot of money. Not to mention, you don't often see a direct benefit from monitoring your data quality. It doesn't generally lead directly to publications. Sometimes researchers will cut corners, thinking. I can skip housecleaning if I don't invite anybody over to visit. But sharing your analyses and de-identify data publicly is a growing requirement in clinical research. And trust me, data quality can go wrong in all sorts of fascinating ways. Here's the overview of what we'll cover in this segment. What does high quality data mean anyway? And why should we spend any time worrying about it? What are the key aspects of data quality that I should focus on? And how can I design forms that promote high quality data collection in messy real world settings? Data quality is a more complex and subjective subject than we usually realize. We're interested in the applied concepts here; so let's take a look at a very straightforward example. Here's an invented list of six people named Brown who live in California. But there's a big problem with this list. Alice Brown and Bonnie Brown live in Zip Codes that don't exist according to the United States Postal Service. That's pretty bad quality don't you think? But what if I told you I was just interested in looking at the states that people lived in and the Zip Codes were completely irrelevant in my analysis. In that case, this might be a perfect data set for my use. Don't you think? So if the scale if your clinic is miscalibrated and the study staff has collected hundreds of inaccurate weight values. But you're not using weight in your analysis, then to you the data are fit for use. Arthur Chapman in his book the Principles of Data Quality presented it this way. In a database, the data have no actual quality or value. They only have potential value that is realized only when someone uses the data to do something useful. This is why it's really important to look at your data in the context of how you plan to use them. High quality data are really important in many fields. Banking and accounting, political and military policy, product manufacturing, patient care, and certainly by medical research. But nobody can agree on how we should measure this fitness for use. Technically there's an ISO specification on data quality in the works. ISO 8,000 but its design for things like supply chains and manufacturing not clinical research data. There are many characteristics of data that are considered important in the literature when your thinking about fitness for use, completeness, conformity, trust, coverage. This basically means data are of high quality when they are complete, when they conform to specifications, when they are trustworthy, and when they have good coverage of the subject, and so on for the rest of the list. You might care about more than one of these. Which is why they're called dimensions of data quality. This is a partial list, but there certainly is redundancy in it too. Take a look at these 3 in blue: accuracy, correctness, and free-of-error. For our purposes, they mean pretty much the same thing. Because of that redundancy, you might call them different names. But, there are two dimensions of data quality that are particularly relevant to clinical research. They are completeness and accuracy. Let's talk about completeness first. Going back to our example, imagine the mystery Zip Codes were blank instead. In this case, we're missing Zip Codes for two of the five people. So the data set certainly isn't complete. If you find your research data are full of holes too, like a piece of Swiss cheese, then you have a data quality problem. Unfortunately, real world studies are never perfect, and we rarely get a chance to collect every piece of data in exactly the way we planned. So completeness problems are inevitable. There are different types of completeness problems. So try to avoid blank fields in your data set, and instead design your forms to capture why the data are missing. When doing that, you'll need to consider who is providing the information, and how it's being entered into your system. The who and how are different for surveys and for data entry forms for example. So there are different approaches to reducing incomplete data in these cases. For surveys, your goal is to make it as easy as possible for a, a respondent to read and answer the questions without skipping anything. That's discussed in the section of this course on survey design. But, try to keep an encouraging tone in your survey instructions too. Try to avoid negative phrases like, do not leave any questions unanswered. As it can negatively affect your respondents. Perhaps you have a patient completing an anonymous survey. In that case, you have one chance to collect the data. Because it's anonymous, you won't be able to contact them with a follow up after. Make sure to check for blank values before you let them submit the survey. It's also good practice to offer survey options like, I don't know, not applicable, or declined to respond, where appropriate. If you're asking about a person's date of birth, I don't know and decline to respond, might be good options to include depending on the setting. Not applicable on the other hand probably isn't relevant. Consider these cases for every question you design. It's important to realize that it's not always easy to implement this for every question in every survey delivery system. Sometimes you might end up with awkward response options, like the one depicted here. Because you can't combine radio buttons, and text fields in Red Cap. What is your date of birth, enter value, don't know, decline to answer. And if you select, enter value them you get a text box below. You'll have to decide if it's more important to have a shorter streamlined survey or to reduce missing data. That will depend on the details of your study and your statistician is a good resource. The issue with data entry forms is similar. Build in a way that allows your users to tell you why a value is blank. Here are some of the categories of missing data often sited in best practices books. Not applicable. Your data entry form has a question about pregnancy but your patient is male, not applicable is an appropriate answer. Unknown. Your patient did not know his date of birth. You can record it as unknown. Pending. You're running some lab tests as part of your study. You sent the samples off to the lab, but you haven't received the results back yet. These values should be marked, pending. Which means they haven't been filled in yet, but you expect they will be. Capturing this type of temporary information is particularly important. When you need databases that are current and up to date. Remember that's a dimension of data quality too. Missing is a category that most of us are familiar with. Perhaps you're copying data from a paper case report form into your electronic system. You're expecting a value in the box that's labelled weight, but it's just not there. It's missing. And finally, there's missed, or not done. Perhaps there was a scheduled lab test, but the patient didn't arrive for a study visit. Or the blood pressure cuff was missing from the exam room, so the study nurse couldn't take the measurement. These pieces of data were truly missed. It means they are not in the database, and they never will be. If you are capturing this information for continuous variables, choose codes that map to impossible variables for that value, not just unlikely values. Let's move on to the second component of data quality that I want to focus on, and that is accuracy. Revisiting our example of the six California residents. Let's say if you went to knock on Arnold's door you'd find he lives in California but not in the zip code that's listed in this table. We've already talked about how these data aren't complete, but in that case they aren't accurate either. Maybe he moved a year ago and this list hasn't been updated. In which case the data score low on another dimension of data quality also. They are not current. Capturing accurate data is a matter of asking the right questions in the right way. That's why it's important to design and pilot your data entry forms, and properly validate your surveys. It's also important to make allowances for approximate data. You might not be able to identify the exact date of an adverse event that the patient suffered. Especially if you're collecting information through chart abstraction, all you can verify is what's documented, and sometimes the level of detail just doesn't exist. Questions relying on patient recall like, when did you last sprain your ankle, how old were you when you had that surgery, and how many cigarettes did you smoke last week? Often produced inexact answers. Capture that vagueness when possible and don't force the data into artificial categories. This happens most commonly with dates. Sometimes you can only figure out the month or year something happened. For critical date fields, you can design your form so you have a box to enter the date. And then a selection option to define whether the date is exact to the day, exact to the month, or exact only to the year. And in the end, why should we bother spending all this time making sure our data capture tools are designed to handle data errors? Because low quality data can skew your study results. Especially if its a systematic error rather than a random one. We have an ethical responsibility in clinical research to deliver the best run study possible because those findings might inform patient care. Here's a graphic you've seen in previous videos but it's really important in the context of data quality. Garbage in, garbage out. Low quality data is like garbage going into your analysis. Even the most skilled statisticians can't compensate for wildly incomplete and inaccurate data so plan ahead. That's it for our data quality overview on this video. We learned that the quality of a data set has to be assessed in the context of how you intend to use the data. It's the concept of fitness for use. We looked at the different characteristics or dimensions of data quality including the two ones most commonly addressed in research, data completeness and accuracy. And finally, we looked at some strategies to improve data completeness and accuracy in your data forms. If you don't have a data value, then capture why you don't have a data value instead of leaving a field blank. In the next section, we'll talk about how you can measure data quality. See you then.