### Data Preparation Latest Statistics

• 76% of data scientists say that data preparation is the worst part of their job, but the efficient, accurate business decisions can only be made with clean data. 
• Data scientists and data analysts report that 80% of their time is spent doing data prep, rather than analysis. 
• The upper and lower fences represent values more and less than 75th and 25th percentiles , respectively, by 1.5 times the difference between the. 
• According to previous studies, missing values are divided into two categories missing completely at random and no missing at random , depending on the types of missingness that occurred . 
• Regression analysis uses simple residuals, which are adjusted by the predicted values, and standardized residuals against the observed values to detect outliers . 
• According to the source, in 2012, advertising expenditures for this industry reached 237.88 million U.S. dollars. 
• Available to download in PNG, PDF, XLS format 33% off until Jun 30th. 
• His main reason was that 80% of the work in data analysis is preparing the data for analysis. 
• For example, within one standard deviation of the mean will cover 68% of the data. 
• So, if the mean is 50 and the standard deviation is 5, as in the test dataset above, then all data in the sample between 45 and 55 will account for about 68% of the data sample. 
• We can cover more of the data sample if we expand the range as follows 1 Standard Deviation from the Mean 68% 2 Standard Deviations from the Mean 95% 3 Standard Deviations from the Mean 99.7%. 
• A value that falls outside of 3 standard deviations is part of the distribution, but it is an unlikely or rare event at approximately 1 in 370 samples. 
• For smaller samples of data, perhaps a value of 2 standard deviations (95%) can be used, and for larger samples, perhaps a value of 4 standard deviations (99.9%). 
• The IQR is calculated as the difference between the 75th and the 25th percentiles of the data and defines the box in a box and whisker plot. 
• The 50th percentile is the middle value, or the average of the two middle values for an even number of examples. 
• If we had 10,000 samples, then the 50th percentile would be the average of the 5000th and 5001st values. 
• We refer to the percentiles as quartiles because the data is divided into four groups via the 25th, 50th and 75th values. 
• The IQR defines the middle 50% of the data, or the body of the data. 
• The IQR can be used to identify outliers by defining limits on the sample values that are a factor k of the IQR below the 25th percentile or above the 75th percentile. 
• The IQR can then be calculated as the difference between the 75th and 25th percentiles. 
• # calculate interquartile range q25, q75 = percentile, percentile. 
• # calculate interquartile range q25, q75 = percentile, percentile. 
• We can then calculate the cutoff for outliers as 1.5 times the IQR and subtract this cut off from the 25th percentile and add it to the 75th percentile to give the actual limits on the data. 
• 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 20 2 # identify outliers with interquartile range from numpy.random import seed from numpy.random import randn from numpy import percentile # seed the random number generator seed. 
• * randn+ 50 # calculate interquartile range q25, q75 = percentile, percentile. 
• 75) iqr = q75 q25 print. 
• 75th=%.3f, IQR=%.3f’ % ). 
• 50 # calculate interquartile range q25, q75 = percentile, percentile. 
• 25th=%.3f, 75th=%.3f, IQR=%.3f’ % ). 
• the identified 25th and 75th percentiles and the calculated IQR. 
• 1 2 3 Percentiles 25th=46.685, 75th=53.359, IQR=6.674 Identified outliers 81 Non outlier observations 9919 1. 
• # evaluate predictions mae = mean_absolute_errorprint(‘MAE %.3f’ % mae). 
• Within cluster sum of squares by cluster ##  46.74796 56.11445 ## (between_SS / total_SS = 47.5 %). 
• For instance, by varying k from 1 to 10 clusters For each k, calculate the total within cluster sum of square Plot the curve of according to the number of clusters k. 
• Compute the estimated gap statistics presented in eq. 9. , compute the standard deviation sd=√∑b(log(W∗b). 
• (between_SS / total_SS = 71.2 %) ##. 
• As noted above, it’s a time consuming process The 80/20 rule is often applied to analytics applications, with about 80% of the work said to be devoted to collecting and preparing data and only 20% to analyzing it. 

