So, another good practice to consider when collecting, and managing study data is to go ahead, and think about coding, your variables for data collection. This is going to be important, when you get ready to analyze, the data later on, and so it's very useful, to think about it from the front end of things. A, a good sort of thought experiment here, about the importance of this might be if we asked a group of 10 people, to look around the room, and write down on a piece of paper the gender, of everyone else in the room. If we did that, and we took up those slips of paper, we'd probably get something like, one individual typing in or writing rather male and female, and some other are righting m and f. May be somebody writing males and females, and misspelling one of the words that they written in, and so you get into lot of issues, when you deal with unstructured information. So it's a good idea to limit the choices for categorical type values, It's also a great idea to go ahead, and code the variables, because when you get ready for the analysis phase of a study, that's what you're going to need for the statistics packages. Here, I think it's very, very important to be consistent. So if your coding yeses, and no, as one's and zeros, in one place of your data collection instrument package, you'd better think hard, and try to a priority set it up, the same way for all of the data collection instruments, where you're collecting yeses and nos. Otherwise, again, you get to the end of a study, and you just have this mess on your hands, and you have to continually think about. At how things were collected, how they were stored, how things were coded. So be consistent, similar questions, should have similar codes. It's also very important to realize, particularly in studies that are going to last more than a week, which would be the case with just about anything your going to do, to not rely on memory, and so it's important not only to have coded variables, to think about things in this structured way. But keep code book. That way, you can use it later on, you know, 2, 3 years later, when you are ready to analyze your data, you got everything in one place, and a current code book. And so, you know there, there's no ambiguity there in terms of, of what equals, what from the coding, and the answers. it's a good idea, to think about confounding factors for study, and this contradicts a little bit, with the no fishing example that we, that we mentioned earlier. This is not the same as fishing, any time you're doing a study, there are always going to be. a number of confounding factors, that you should go ahead, and think about in terms of things that might impact the results of this study. So if you think that race, gender, age, smoking status, weight, BMI those sorts of things are going to be important, when you're analyzing the results, and being able to sort of see differences in effect. Between treat, treatment groups, it's a great idea, in fact it's an essential idea to put these into the data collection package, so that you have those information, and you can, you can move forward in the analysis, and you have, have those to use as confounding factors. One example, that I know here is the, is a study that did they violated our first principle and that takes alot of time to try to get things right before the study, they did what, what I think was a, a great study. This, this is not a study I participated in, but it, but one that uh.e A colleague was telling me about. They did this great study, and it was a metabolic type study, and, you know, anyway, the results were quite positive, but the group neglected to collect one of these confounding factors, in this case, baseline BMI, before they started the study, the actual study procedures. With that they really limited themselves in, in the ability to sort of analyze data, and changes from baseline later on. BMI depends on height and weight. While the height might not change over time, the weight certainly would change, particularly in a metabolic study. And so there was really no way, they could go back, and get that measurement that they had missed. you know, and it was an important confounding factor. I don't know the outcome of that story. I suspect that the study was strong enough to publish, but it was a study that may have been published in a second tier journal, rather than a first tier journal. And so, you know again, the scientific impact of that might have been less than than it would otherwise have been had, had they gone, gone through the the good practice. Of, of defining all the confounding factors up front. One, one other thing that I usually bring up in the confounding factor space is collecting DNA. And this depends a little bit on the, the institution. I've also found that, that to depend a little bit on the country, in which the research is taking, taking place. But, so, check, check applicable laws, and policy regulations for, for your particular institution, in your particular location. But one of the things, that we typically recommend is that, that individuals go ahead, and collect DNA with consent from patients, of course. But, collect DNA. That way, you know, if, if there are confounding factors that are seen to be genetic, either now or, you know, found to be maybe, possible factor while you're conducting this study. In this maybe, longitudinal, 2 to 3 year study. And maybe that there are things that aren't known today. But that will be known next week, that, that could, have an impact in analyzing the data, and being able to sort of tell treatment effects in clusters of people. So whenever possible, we recommend, collecting the DNA, even if you're going to save it for analysis later on. But again, I can't stress enough to make sure that you're working through the proper regulatory channels, the human subject consent process. And, and local country rules in terms of being able to make that happen. So a great good practice, or best practice I think is to, to think hard about storing raw data fields, rather than calculated fields. A great example here would be, you know, the BMI that we were just talking about. You could collect the height, you could collect the weight, and you could convert that into the BMI. But if you collected the BMI, and stored it, there's no way, really, to go back and decompose those values into height and weight. Hypertensive is, is a, you know, another class of, of types of data that we might be able to look at there. I was working a study, a long time ago where we were it was a longitudinal study, 5 or 6 year study. And the study team, had collected hypertensive status as either yes, or no, or gone ahead and some, some categorical way of collecting that data, rather than storing the systolic, and diastolic blood pressure. Unfortunately during the middle of that study, the cat, categorization rules from the American Heart Association changed. And, and so they were stuck with this sort of yes, no, or collapsed data, or calculated data, around hypertensive status, and they could not go backwards, and sort of recalculate to the new rules. So, whenever possible, store the, store the raw data, because you can always calculate the variables after the fact, but you can never go the other way. don't, don't code continuous variables, so, so things like systolic, and diastolic blood pressure temperature versus fever, yes, no. Again, store them intact, and you can always go back, and calculate the the collapsed variables after the fact. It's a good idea to think about missing data. So there, there are a number of reasons, why data might, show up as missing during the middle of a study. Or, or as you're progressing along with a particular patient in a study. you should always think about that up front. You should allow, for the possibility of missing data in data fields. we recommend never storing a 0, or other possible value as a default in the data collection instruments, because it's difficult to know, impossible to know they're really whether the data were missing or whether it was actually just left as a default. there are reasons why data may be missing that you might want to account for, and, and actually collecting the data. So a good example for here, might be you know, you ask an individual, when they were diagnosed with Parkinson's disease. It may be that, you have an individual that will say, hey I know it very well. It was my son's birthday, in 1998 so, that would be this exact date. Somebody else might say, you know I can't remember exactly, but I know it was some time in 1998, because I remember some other sort of triggering event. Maybe somebody remembers that, you know it was, it was sort of late 90s et cetera. So all of that is information, that you're going to lose if, if all you have is, is, is a firm rule, that whatever goes in this particular data field is an exact date. So, so be careful with that. A lot of times we can you know, you can sort of think about those scenarios up front. Create at least rules, for the data collector in being able to sort of leverage the, the information, and the ambiguity of information, but not throw it all away if, if you don't have exact values. Other, other times this comes into practice might be in studies, particularly in behavioural studies, where you have individuals being asked question that, that they may not disclose. So an example might here, here might be, you know, when did you last, drink alcohol. You know, the individual might tell that you know they just don't remember, they might tell you that, that, that question I don't, particularly want to answer. And so again, you have got the information there that, that is relevant, and could be used later in the analysis around the relevancy of the questions, and that sort of thing, that you don't want to throw away, but it's no exactly the answer that you are looking for. So, think about codes for those, you know, a lot times if it's alot of times for a patient doesn't remember, we will code in a value of 888 or something. That's that's relevant for the consistent for the study. And again, back to a previous slide, it's important that when you think of, think of these codes, and these rules that those go right back into documentation within the code book. So that, so that you don't have to remember it 3 years from now. You've got a good solid plan around your data. For quantitative studies, this varies a little bit depending on the type of study, or research that you're conducting, but for quantitative type studies, it's very important, I think, to minimize the number of open ended, free form, text data entry fields. So, you know, just sort of a type your answer here. Versus choose from a categorical list, or you must enter a number, or a code. The reason for that is, it's very, very difficult to analyze those data, if you don't have a consistent framework for entering, and storing the data. It's very, very difficult to analyze that, that data after the fact and so the worst case is that. You'll, you'll collect a lot of information and it's sort of like information that you might use in the margins of the paper, just sort of notes, you'll never analyze it. The best case, is that you're going to need to create, some sort of a system where a domain expert, would take the information that's been either hand in, or typed in free form text. And they're going to, they're going to have to go through that, and try to pull out the concepts, and code that after the fact. It happens sometimes, and particularly with studies where you've got, you know, certain types of qualitative studies, that's about the best you can do, and it's a good design for those studies. But for the most part, whenever possible, minimize open ended questions, and try to stay with categorical or data field types to that are structured numbers, certain codes, et cetera. This gets time-consuming, but again, we go back to that first principal of putting a lot of time into this study up front.h Means you're going to actually save time later on in your analysis, because it's going to be much more useful. So you know, it's a good idea that, you know, you know, from the time a patient hits the door for a particular study, or from a time maybe in a different type of study, when we identify a cohort of individuals, we're going to survey them this way. Well we're going to bring them in to the center, and we're going to consent them here, and we're going to go through these processes, we're going to take the measurement with this machine, so very very structured, and thinking about the different procedures that you are going to do and the equipment you might use. And the the, the actual exercises that your, your staff will be going through when they're working, working through procedures of collecting data on the individual study for individual patients. Here, you know, we, we always recommend that don't just stop with sort of the, the logistical pieces of it. Actually have sort of practice cases, where you're entering data into the case report forms that you have designed. Once you've got practice data entered into the practice, into the test case report forms or surveys, if you're doing patient reported outcome research. Have your statistician go ahead, and take the data that's coming through those. Be able to use that as sort of test cases. And you know, again, in, in the vein of reproduceable research, they can go ahead, and start creating scripts. So, that they're, they're ready to analyze that data, because they know exactly what's going to be coming in, in, in exactly the right structure. So, again, this takes some time but, but in the long run What, typically, we find is, the best research teams do this. It actually decreases time overall, because if they built in time for in service training, if they've performed at least one test, if they've gone full cycle on using a small set of data, either from test subjects, or maybe from a, from a small set of sample subjects to basically sort of get things rolling and get all of the kinks worked out. That it saves a lot of time, and integrity in this study is much, much, much, stronger. it also takes, saves time if you're doing, if you've got faculty or, sorry if you've got staff turnover during the middle this study. That you know, if everything's documented well, if you've got structured information, and notebooks around, the, the different procedures for the study. There's a lot less training involved a, as you have turnover increased staff. it's, it's a good idea, it's an essential idea to, to be very, very careful, when we're dealing with confidential data. So really, I like to think about all information, that we're collecting as very, very confidential, but particularly, if you're collecting patient identifiers, you want to create appropriate security safeguards, for accessing the study data. for, for being able to sort of sequester the confidential information, or the patient patient identifier information. and, and only have individuals, that need to see that information see it, and use it and leverage it. never should you email sensitive information, always review a data set for identifiers before sending in, in any fashion. in the United States at least we, we adhere to HIPAA. the HIPAA policy, and rules when we're dealing with data sets. HIPAA has, has, the HIPAA policy has a nice set of 18 identifiers that that, can be consulted when you're thinking about, what is the what, what is the, the confidential information that must be protected. But again, I, I always think about, and train my own staff in,in in working with research teams and consulting. ask them to think about the HIPAA identifiers, as a starting point. there's a lot of research out there that, that show that even without identifying information in the right hands of very bright and talented individuals, there, there's a lot of ability to reidentify people even when you don't have those identifiers. So it's good, good practice to think about that up front, and always be diligent about how you're dealing with data. Whether or not you feel it's identified, or deidentified. For data sharing and exchange, again, at some point during your study you may want to work with other teams in the US at least, if you're funded by the National Institutes of Health. There is, there is a mandate if you are Conducting studies, research studies in, in excess of a certain amount, that you have a data sharing exchange policy. So it's a good idea again, to think about that a priority. How are you going to do that at the end of the study? one of the things that is absolutely critical there, if you're going to be sharing any data. And again both of those the, the code book or the data dictionary and the SOPs the standard operating procedures. Let's kind of go back the the concepts that we've already talked about as being good ideas in sort of the prospective conduct of your study, anyway. So, so think hard about those two things, and others when you're thinking about the data sharing, and possibilities going forward. It, it's a bit of good practice, good idea to think very hard about the data integrity, and validation. especially as you're collecting the information. So again, if, if you're using software, it's a good idea to think about think about setting things up, so if you're supposed to enter a number into this field. The software, ought to be smart enough to enable a field type check, to at least check, and see if this is a number. Or maybe it's a number that's supposed to be between 3 and 100. You know, with good software you can typically set it up, so that the software will check that. And so you get some benefits, of using electronic data capture. Over, maybe just manual capture if, if you're recording things on paper. But by doing the, setting things up properly, and correctly in your software. It's a good idea as well, not, not always and, and, you know, different statisticians have different, different opinions on, about the way that, that data quality should be monitored. But one of the ways, that data quality can be sort of built in to an electronic system would be thinking about double data entry. And, again, this is just one way. We'll probably talk in other sections, in other modules later downstream, about other ways as we're thinking about data quality. But one way that an electronic system can help, is through a process called double data entry. The idea here being that, we can never guarantee absolute certainty, that an individual will not make a mistake, when either hand entering data, onto a piece of paper or into an electronic system. it's also the case that people that are very, very diligent are, are expensive. And you might have to go through multiple passes of of checks by the same individual to, to make sure that they, they are, even they are confident that, that they are correct. Another way that we kind of get around, this is sometimes we will build a system, where we have two different independent sets of case report forms for a study, and we'll have two different individuals enter the exact same information into those systems. And then you can have the software check check and make sure there's agreement between the two on each patient, each particular data element. The idea here being that they will make mistakes, but it's very unlikely that they'll make the make the same mistake, and so the software can, can sometimes get going, and help correct that and give, give extra confidence in the quality of the data. Again that's called double data entry, and we can talk about that, we will talk about that a little bit later in, in data quality modules. It's very important to keep, audit trails of all of the data changes, and again, here, here We're talking about, keeping a log of who added, or changed the data, when if it's changed the data, a lot of times what we also want to know for, clinical trials, what we need to know is why that data changed. So, so it's very very important to keep an audit for every single piece of data that you're collecting for a study. Who did it, when they did it, if it was changed, perhaps why, why, why that, that, that data elements was changed. So this is important in good clinical practice, for paper case report forms. It's also just as important, in electronic, research data capture, software. So, so here's where, if I'm sort of teaching an in person class, and I've got people in the room with me. I typically stop, and say, you know, look, we started the process here by thinking about some very, very easy concepts, you know, up there in the left hand corner of this slide is this very easy concept of garbage in, garbage out, so if we for doing the study, think a lot about it in, in the front end. And put a lot of time in it. we, we sort of ended up with this double data entry, and audit trails if you're using electronic systems. And this is where I usually stop, and sort of ask students, you know, hey not exactly like the lady screaming there, but we get some puzzled looks about, you know, that's really quite difficult to do. It's difficult, but again it's science and it's very important for the scientific integrity, protecting the patients. And really protecting the integrity of the data that you're going to be using to, to base assumptions, and base analysis, and conclusions on, then will go on and impact future research. So it's difficult, but it's very, very important. This is where I also say, don't, don't worry. You know, we're going to cover some, some real practical, real world approaches. in, in the next modules, that are going to kind of take us through doing this in practice and, and exercises, and so forth so that, you know, by kind of going through the kind of processes, that we'll talk about later on in, in different modules. We'll, we'll cover these concepts, and more. So, in, in summary, I think that, in this module, what we were really trying to do, was cover basic concepts behind planning for collection of research data. The importance of thinking through things, before starting collection of data, I hope came through several times. And, and again, the importance of really good, solid, good record keeping, and documentation of, of records for ongoing and, and shared research.