You've learned about how to define what should be the data, what should be the definition of y, what should be the definition of the input x. But how do you actually go about obtaining data for your task? Let's take a look at some best practices. One key question I would urge you to think about is how much time should you spend obtaining data? You know that machine learning is a highly iterative process where you need to pick a model, hyperparameters, have a data set, then training to carry out our analysis and go around this loop multiple times to get to a good model. Let's say for the sake of argument that training your model for the first time takes a couple of days, maybe much shorter or maybe longer. Let's say just for the sake of arguments that carrying out error analysis for your project for the first time may take a couple of days. If this is the case, I would urge you not to spend 30 days collecting data, because that will delay by a whole month your getting into this iteration. Instead, I urge you to get in this iteration loop as quickly as possible. Training a model and error analysis might take just a couple of days. I would urge you to ask yourself, what if you were to give yourself only two days to collect data? Would that help get you into this loop much more quickly? Maybe two days is too short, but I've seen far too many teams take too long to collect their initial data set before they train the initial model. Whereas I've rarely come across a team where I said, "Hey, you really should have spent more time collecting data." After you've trained your initial model, carry out error analysis there's plenty of time to go back and collect more data. I found for a lot of projects I've led when I go to the team and say, "Hey everyone, we're going to spend at most seven days collecting data, so what can we do?" I found that posing the question that way often leads to much more creative ways, our ways are still 100 percent respect user privacy and follow regulatory considerations if any, but much more creative, scrappy ways to get a lot of data quickly. That allows you to enter this iteration loop much more quickly and let the project make faster progress. One exception to this guideline is if you have worked on this problem before, and if from experience you know you need at least a certain training set size. Then it might be okay to invest more effort up front to collect that much data. Because I've worked on speech recognition I have a good sense of how much data I'll need to do certain things and I know it's just not worth trying, to train certain models if I have less than certain number of hours of data. But a lot of the time if you're working on a brand new problem and if you are not sure and is often hard to tell even from the literature, but if you're not sure just how much data is needed, then is much better to quickly collect a small amount of data, train a model and then use error analysis to tell you if is worth your while to go out to collect more data. In terms of getting the data you need, one other step I often carry out is to take inventory of possible data sources. Let's continue to use speech recognition as an example. If you were to brainstorm a list of data sources, this is maybe what you might come up with. Maybe I already own 100 hours of transcribed speech data, and because you already own it the cost of that is zero. Or you may be able to use a crowdsourcing platform and pay people to read text. You provide them a piece of text and ask them to read it out loud and just create text data where you already have the transcript because they were reading a piece of text that you have. Or you may decide to take audio that you have that hasn't been labeled yet, and to pay for it to be transcribed. It turns out this is more expensive on a per hour basis than paying people to read texts, but this results in audio that sounds more natural because people aren't reading. For 100 hours of data it may cost $6,000 to get high quality transcripts. Or you may find some commercial organizations that could sell you data. Through an exercise like this, you can brainstorm what are the different types of data you might use as well as their associated costs. One column that's missing from this that I find very important is the time costs, so how long will it take you to execute a project to get these different types of data? For the owned data you could get that instantaneously. For crowdsourced reading you may need to implement a bunch of software, find the right crowdsourcing platform, carry out software integration so you might estimate that that's two weeks of engineering work. Paying for data to be labeled is simpler but still is work to organize and manage, whereas purchasing data maybe there's a purchase order process that may be much quicker. I find it some teams won't go through an inventory process like this and would just pick a random idea, and maybe decide to use crowdsourcing and reading to collect data. But if you can sit down, write to all the different data sources and think through the trade-offs, including costs and time, then that can help you to make often better decisions about what sources of data to use. If you are especially pressed with time, based on this analysis you may decide to use the data you already own and maybe purchase some data and use that over the middle two options in order to get going more quickly. In addition to the amount of data you can acquire and the financial costs and the time costs, other important factors that's application dependent will include data quality. Where you may decide for example, that paying for labels actually gives more natural audio than having people sound like they're reading, as well as really important the privacy and regulatory constraints. If you decide to get data labeled, here are some options you might think through as well. The three most common ways to get data labeled are, in house where you have your own team label the data, versus outsource where you might find some company that labels data and have them do it for you, versus crowdsource where you might use a crowdsourcing platform to have a large group collectively label the data. The difference between outsource versus crowdsource is that, depending on what type of data you have, there may be specialized companies that could help you get the label quite efficiently. Some of the trade offs between these options, having machine learning engineers label data is often expensive. But I find that to get a project going quickly, having machine learning engineers do this just for a few days is usually fine and in fact, this can help build the machine learning engineers intuition about the data. When I'm working on a new project, I often don't mind spending a few hours or maybe a day or two labeling data myself if that helps me to build my intuition about the project. But beyond a certain point you may not want to spend all your time as a machine learning engineer labeling data, and you might want to shift to a more scalable labeling processes. Depending on your application, there may be also different groups or subgroups of individuals that are going to be more qualified to provide the labels, y. If you're working on speech recognition, then maybe almost any reasonably fluent speaker can listen to audio and transcribe it. Speech recognition, because of the number of people that speak a certain language, has a very large pool of potential labels, well, hopefully they will be careful and diligent. For more specialized applications like factory inspection or medical image diagnosis, a typical person off the street probably can't look at a medical X-ray image and diagnose from it, or look at a smartphone and determine what is and what isn't a defect. More specialized tasks like these usually require an SME or subject matter expert, in order to provide accurate labels. Finally there are some applications was very difficult to get anyone to give good labels. Take product recommendations, there are probably product recommendation systems that are giving better recommendations to you than even your best friends or maybe your significant other. For this, you may just have to rely on purchase data by the user as a label rather than get humans to label this. When you're working on an application, figuring out which of these categories of application you're working on and identifying the right type of person or persons to help you label, will be an important step to making sure your labels are high quality. One last tip. Let's say you have 1,000 examples, and you've decided you need a bigger data set. How much bigger should you make your data set? One tip I've given a lot of teams is don't increase your data by more than 10x at a time. If you have a 1,000 examples and you've trained your model on 1,000 examples, maybe it's worth investing to try to increase your dataset to 3,000 examples or maybe at most 10,000 examples. But I would first do a 10x or less than 10x increase first, train another model, carry out error analysis and only then figure out if it's worth increasing it substantially beyond that. Because once you increase your dataset size by 10x, so many things change is really difficult. I found this really hard to predict what will happen when your data set size increases even beyond that. Is also fine to increase your dataset size 10 percent or 50 percent or just 2x at a time, so this is only an upper bound for how much you might invest to increase your dataset size. This guideline hopefully will help teams avoid over investing in tons of data, only to realize that collecting quite that much data wasn't the most useful thing they could have done. I hope the tips in this video will help you to be more efficient in how you go about collecting your data. Now, when you collect your data, one of the things you might run into is the need to build a data pipeline. Where your data doesn't come all at once but there are most complete processing steps that your data has to go through. Let's go on to the next video to take a look at some best practices for building data pipelines.