Data Preparation and Analysis

Published on Nov 17, 2023

Scene 1 (0s)

[Virtual Presenter] Today, we will be discussing data preparation and analysis. I would like to take a moment to acknowledge the hard work of Dr. Brown and many talented teaching assistants who have previously helped edit these slides over the years. Their efforts and contributions are highly valued. Now, let us proceed to the topic at hand..

Scene 2 (24s)

[Audio] Today we are going to discuss data analysis. Before we dive into analysis, it's important to consider a few key issues. Then, we'll need to prepare our data for analysis. Finally, we'll look at the data analysis itself. But first, I'd like to take a moment to extend special thanks to Dr. Brown and the many teaching assistants who have so diligently edited the data preparation and analysis slides over the years. Their hard work is truly appreciated." Before we move forward with data analysis, it is essential to consider a few key issues. Then, we have to make sure that our data are properly prepared for analysis. Once we have all our data ready, we can finally move on to analyze the data. But before we go any further, I want to take an extra moment to thank Dr. Brown and the team of teaching assistants who have worked so diligently on the data preparation and analysis slides over the years. The time, effort, and dedication they have put in is greatly appreciated..

Scene 3 (1m 34s)

Issues to consider before analysis “Statistics are no substitute for judgment.” -Henry Clay.

Scene 4 (1m 43s)

[Audio] It is essential to understand the context of the study before proceeding to analyze the data. We must consider the objectives of the study, the research questions or hypotheses posed, the type of design applied, and the unit of analysis. These components must be considered to guarantee the data is interpreted and analyzed correctly..

Scene 5 (2m 6s)

Issues to Consider Before Analysis. Eligibility criteria Sample design Sample size and response rate Do you have adequate power to detect a difference?.

Scene 6 (2m 20s)

[Audio] Before beginning to analyze data, it is essential to ask important questions regarding your research question, hypotheses and the effect of interest that you are going to test. Having a conceptual framework is key to ensure the accuracy of the results. Dr. Brown and the TAs have provided invaluable help in editing the slides over the years..

Scene 7 (2m 43s)

[Audio] The psychometric properties of the data preparation and analysis slides must be evaluated for validity, content representation, correlation with a gold standard, sensitivity, specificity, positive predictive value, negative predictive value, construct theoretical grounding, principal component analysis, and factor analysis. Dr. Brown and the Teaching Assistants have extensively worked to edit and improve these components, for which our gratitude is expressed..

Scene 8 (3m 14s)

[Audio] In order to carry out an analysis, identifying the independent and dependent variables is essential. Additionally, classifying the levels of measurement into nominal, ordinal, interval, and ratio is also important for obtaining meaningful results. Knowing the levels of measurement for the key variables is a necessary step to ensure that the analysis is effective..

Scene 9 (3m 40s)

[Audio] We're going to look at levels of measurement in data preparation and analysis. This slide shows four types of measurement: nominal, ordinal, interval and ratio. Nominal is for attributes with names but no order. Ordinal is for rank-ordered data, but differences between rankings don't mean anything. Interval measurements indicate meaning between differences, but ratios cannot be interpreted. Ratio is the only type of measurement where both differences and ratios can be interpreted..

Scene 10 (4m 18s)

[Audio] I would like to acknowledge Dr. Brown and the teaching assistants who have been essential in bringing this project to completion. Their hard work and dedication was invaluable. Now I will show you a Measurement Matrix which includes a table of survey questions and their corresponding levels of measurement. This matrix is very important in designing the survey questions and collecting data that will help us reach our objective. For example, survey question 1 relates to research objective #1 and is classified as an ordinal level of measurement. It asks about satisfaction with patient-provider communication. Survey question 2 relates to research objective #2 and is also classified as an ordinal level of measurement. It inquires about health status. Lastly, survey question 55 is a nominal level of measurement and relates to research objectives #1, #2 and #4. It asks about the demographics of the participant..

Scene 11 (5m 24s)

[Audio] Data preparation is an essential step for data analysis to be precise and effective. It involves transforming raw data to a form that is more applicable for analytical purposes. This encompasses operations such as configuring the data for entry, entering the data, carrying out initial analysis, refining the data and assessing any extant data. A huge recognition goes to Dr. Brown and all the teaching assistants who have dedicated their time in preparing and analysing the data slides over the years. Your effort is invaluable..

Scene 12 (6m 2s)

[Audio] It is essential to define all variables for proper data entry. This includes creating a data dictionary and codebook which outlines the variables and the details such as variable names and labels, data type, numeric and/or decimal values, valid values, valid ranges, and missing value codes. These steps are necessary for preparing for data entry and guarantee that the data is of high quality..

Scene 13 (6m 30s)

[Audio] Preparing for data entry is an essential component for data analysis. Assigning codes to responses is an effective measure to ensure data is grouped into categories for simpler analysis. To aid in this process, a range of coding schemes are available. An example of this could include assigning 0 to male and 1 to female, or 1 to Poor, 2 to Fair, 3 to Good, and 4 to Excellent for a Health Status variable. Dr. Brown and the many TAs deserve recognition for their hard work in editing the data preparation and analysis slides over the years, their contributions are immensely valued..

Scene 14 (7m 11s)

[Audio] Discussing data preparation and analysis, we look at the process of preparing for data entry and the scoring of questionnaires. To accurately score a questionnaire, one may need to assign values or scores to each response for measuring the same construct and forming a composite measure such as an index or a scale. However, caution must be taken to apply these scores and measures correctly in order to ensure accurate results..

Scene 15 (7m 39s)

[Audio] All the TAs have invested a considerable amount of time and effort into preparing and analysing data for this project. This slide gives a summary of the data which has been gathered, comprising of variables, levels of measurement, scoring parameters, how the data was handled and for how long the data has been available. We can notice that a diversity of data points have been collected, for example age, apathy, BMI and cognition. This data is indispensable for us to analyse and obtain valuable information from it. A special thanks to everyone who has added to this project..

Scene 16 (8m 20s)

[Audio] Data entry is an essential part of the data preparation process, with various methods such as manual entry, optical character recognition, computer-assisted data collection methods, and transferring the data to statistical software. Dr. Brown and the team of TAs have put a lot of effort into refining and perfecting these methods over the years, and it is greatly appreciated..

Scene 17 (8m 45s)

[Audio] Data Cleaning is a critical part of survey preparation and analysis, requiring review and correction of raw survey data to guarantee accuracy, conformity to the purpose of the questions, uniformity of entry and completeness. Additionally, the data is arranged in a way that makes coding and tabulation simpler and faster. To Dr. Brown and all the teaching assistants who have aided in making sure our data cleansing and analysis processes are excellent, we express our sincere appreciation. Their help has been invaluable..

Scene 18 (9m 21s)

[Audio] Dr. Brown and the TAs have put in effort to guarantee precision of our data. This includes range edits to guarantee that our subjects' ages are within the indicated range, ratio edits to make sure the figures are reasonable, and comparison with historical data to check for continuity throughout the study sessions. Let us now switch to our next slide..

Scene 19 (9m 45s)

[Audio] Editing data for accuracy is essential. Balancing time spent in activities should total 100%. To detect outliers, the highest and lowest values should be checked to see if any are 3 or more standard deviations from the mean. Cross tabulations can be a helpful way to ensure consistency, for example, to verify that a 10 year old isn't reporting having a college degree..

Scene 20 (10m 12s)

[Audio] It is essential to be meticulous when posing a query. As evidenced in this slide, the respondent ticked off two distinct categories when responding to the query! This demonstrates that it is vital to craft queries that can only be answered with a single response to guarantee the reliability of your data..

Scene 21 (10m 33s)

[Audio] Accuracy and consistency are of utmost importance when it comes to data preparation and analysis. To guarantee that the responses we receive from each participant are correct, we must verify which of the responses provided is the most accurate for that specific participant. This process, known as data cleaning, may require extra effort such as further research, call-backs or centralized and field editing. A special appreciation goes out to Dr. Brown and the many TAs who have worked hard to ensure that our data preparation and analysis is dependable and accurate..

Scene 22 (11m 11s)

[Audio] We acknowledge the substantial contributions of Dr. Brown and the teaching assistants who have offered their assistance in data preparation and analysis. Knowing which statistical test is the most fitting for our data requires an awareness of the data's makeup, the sample size, and the purpose of the analysis. Their experienced counsel has aided us in selecting the best test for our data. We are thankful for their support..

Scene 23 (11m 39s)

[Audio] We show our appreciation to Dr. Brown and the many TAs who have been so helpful in preparing the data and helping to analyse it. These TAs have helped us understand the nature of the data, as well as the number and type of variables for this specific question. They have also supported us in the understanding and application of rank-order or ordinal information. We are truly grateful for their expertise and hard work..

Scene 24 (12m 7s)

[Audio] This project consists of two groups, with each sample having either a normal distribution or a non-normal distribution, and observations being either independent or dependent. Dr. Brown and the teaching assistants have spent much time to edit the data preparation and analysis slides, and their contributions are highly valued..

Scene 25 (12m 29s)

[Audio] We can see from this slide that the data we have collected is meant to create a representation of the entire population. The next step is to ask ourselves some pertinent questions: do the group means differ? Is there an association between the variables? What is the strength and direction of said association? We must not forget to give recognition to Dr. Brown and the many TAs who have done a lot of work in preparing the data for editing and analysis, over the years. Their assistance should be acknowledged..

Scene 26 (13m 5s)

[Audio] Preliminary analysis is essential to comprehending the stored data. We must assess the total amount of participants, review any outliers and values beyond the legitimate range, and assess the shape and peakness of the distribution. We can also analyze measures of central tendency to explore features of the sample such as demographics. This can even assist us in locating any data entry mistakes. Dr. Brown and many TAs have edited the data preparation and analysis over the years..

Scene 27 (13m 39s)

[Audio] Statistics play an important role in data preparation and analysis. There are two primary types of statistics - descriptive and inferential. Descriptive statistics give us a summarized view of the characteristics of the study subjects, including percentages, means, medians, the range, and standard deviation. Inferential statistics can tell us how likely it is that we would have observed similar results due to randomness, calculate associations, and test hypotheses about the population based on sample data. We would like to thank Dr. Brown and the numerous teaching assistants who have helped to create the data preparation and analysis slides over the years..

Scene 28 (14m 24s)

[Audio] This slide provides an overview of the different nonparametric bivariate statistics tests available for assessing relationships between variables. Depending on the type of variable measured, whether nominal, ordinal or mixed, the appropriate test can be used. Also shown are the measures of strength of association, such as phi coefficient, Yule's Q, coefficient of contingency, Cramér's V, lambda, odds ratio, Goodman and Kruskal's gamma, Kendall's tau-a, tau-b, tau-c, Sonner's d, Spearman rank order coefficient, uncertainty coefficient, and Somerts d. These tests of association and measures of strength of association are invaluable in gaining insight into the data structure..

Scene 29 (15m 13s)

[Audio] Using univariate statistics, we can gain understanding on the various characteristics of a sample. We can look at the frequency of each characteristic to identify the most common occurrence and calculate the central tendency - or average - via the mode, median and mean. Additionally, we can analyze the dispersion of the data by assessing the range, variance and standard deviation. This table demonstrates the type of univariate statistics which can be used to describe the sample, and the values that these measure..

Scene 30 (15m 48s)

[Audio] This slide focuses on parametric bivariate statistics, which are tests assessing the association between two variables. These tests allow us to examine how one variable can influence, or be affected by, the other. Examples of tests include independent bivariate regressions, t-tests of differences of means between two groups, one-way analysis of variance, and paired t-tests of difference. Furthermore, various measures can be employed to determine the strength of the association between the two variables, including the Pearson correlation coefficient, biserial correlation, point biserial correlation, and eta coefficient..

Scene 31 (16m 33s)

[Audio] We will start by undertaking univariate analysis of our research data, exploring all applicable variables, classifying them, and assessing their normality. We may also require transforming one or more of them. Next, we will conduct bivariate analysis, looking at the association of exposure and outcome variables, and the associations of covariates with outcomes and exposures, with the aim of recognizing possible confounding factors..

Scene 32 (17m 4s)

[Audio] Stratified analysis can be used to help identify potential effect modifiers in our analysis plan. Additionally, regression analysis, survival analysis, and additional methods could be applied to further inform our results. Nevertheless, it is important to acknowledge and appreciate the effort and commitment of colleagues like Dr. Brown, as well as the teaching assistants who have been involved in editing the data preparation and analysis slides for many years. We express our profound gratitude for their efforts..

Scene 33 (17m 38s)

[Audio] We will begin our discussion by looking at z-tests, which are used to compare means or proportions. After that, we will cover Chi-square, which evaluates the connection between categorical variables. Following that, we will examine relative risk, prevalence ratio, and odds ratios, which measure the strength of the association between two categorical variables. Later, we will explore t-tests, which compare means between two different groups or compare if the mean is different from a fixed value for one group. Additionally, the Mann-Whitney U test will be discussed; this is used when the data is not normally distributed. We will also discuss ANOVA, which is used to compare the means of more than two groups, as well as the Kruskal-Wallis test, which is used for non-normal distributions. Lastly, we will review correlation, which measures the association between two continuous variables..

Scene 34 (18m 42s)

[Audio] Regression is a powerful tool for predictive analytics. Four different types of regression exist - linear, logistic, poisson and survival analysis. Linear regression is employed when the dependent variable is continuous. Logistic regression is used when the dependent variable is binary. Poisson regression helps to forecast count or rate outcomes. Lastly, survival analysis is used to anticipate when an event will occur. All these models give useful insights into the design of a dataset..

Scene 35 (19m 21s)

[Audio] Statistical tests for comparing independent or dependent samples will be discussed. Two commonly used tests to compare two unpaired data sets are the independent Z-test and unpaired/independent samples t-test. Further tests covered include the Wilcoxon rank sum test, the Chi-square test, the Fisher's exact test, linear or logistic regression, and survival analysis..

Scene 36 (19m 49s)

[Audio] We have discussed the various analysis tests that can be used when there is a dependent variable for pre-post designs or designs that involve collecting data on the same individual more than once. These tests include the Paired t-test, Wilcoxon signed rank test, McNemar's test, Repeated measures ANOVA and Repeated measures regression. Dr. Brown and the many TAs who have edited the data over the years have made invaluable contributions and should be thanked for their hard work..

Scene 37 (20m 20s)

[Audio] I would like to express my deep gratitude to Dr. Brown and the many teaching assistants who have dedicated an immense amount of time in creating the detailed data preparation and analysis slides for our course. We would not have access to such excellent and useful resources were it not for these individuals. They have dedicated a great deal of effort in ensuring that our students have the most beneficial learning experience. Therefore, I would like to extend a special thank you to Dr. Brown and the TAs for helping us in so many ways..

Scene 38 (20m 54s)

[Audio] It is important to note that survey sample design involves both probability and non-probability samples. Examples of probability samples include simple random, systematic, and complex designs such as stratified, cluster, and mixed. Non-probability samples are also used. A special acknowledgment must be given to Dr. Brown and the many TAs who have devoted their time to the data preparation and analysis slides throughout the years..

Scene 39 (21m 25s)

[Audio] Dr. Brown and the TAs have contributed to the development of data preparation and analysis slides that enable the use of sample weights for adjusting selection probabilities. Sample weights include post-stratification weights that adjust the sample proportions of demographic subgroups to the population proportions, and non-response weights that increase the weights of respondents to make up for non-respondents who share similar characteristics..

Scene 40 (21m 52s)

[Audio] When working with survey data, researchers often confront the issue of missing data. The reasons for it can range from refusal to answer, providing responses that are not applicable, unintelligible responses, 'don't know' answers, attrition or loss in follow-up, data processing errors, questionnaire programming errors, and alterations in the instrument or design elements such as skip patterns. Thus, it is essential to consider the causes of missing data when analyzing the findings of a study..

Scene 41 (22m 25s)

[Audio] Nonresponse can be divided into two categories - descriptive analysis unit nonresponse and item nonresponse. Descriptive analysis unit nonresponse happens when the respondent fails to provide any answers, while item nonresponse occurs when the respondent partially participates, leaving out certain items. Acknowledging these two types of nonresponse is essential for evaluating the accuracy and integrity of data. Dr. Brown and the Teaching Assistants have been extremely helpful in the process of producing and examining data preparation and analysis slides across the years..

Scene 42 (23m 5s)

[Audio] We'd like to give special thanks to Dr. Brown and the many TAs who have been responsible for preparing and analyzing the data slides over the years. Their dedication and hard work have made a great contribution to our overall knowledge of missing data. Without their help, we would not have been able to make the progress that we have. Their efforts should be applauded and remembered." Missing data is an important issue that needs to be addressed if we want to accurately analyze our data slides. When missing data is ignored or handled inappropriately, it can lead to biased estimates of descriptors and associations, incorrect standard errors and incorrect inference. That's why we'd like to give special thanks to Dr. Brown and the TAs who have provided the data preparation and analysis slides over the years. Their hard work and dedication has enabled us to make meaningful progress in our understanding of missing data and has been invaluable to our research..

Scene 43 (24m 5s)

[Audio] We take a look at missing data in this slide. It is important to be aware of the two-step process for handling missing data: exploration of the pattern of missing data, and then selection of a missing data technique. Ignorable missing data can be further divided into three categories: Missing Completely At Random, Missing At Random, and Missing Not At Random. We will now look at each of these categories in more detail..

Scene 44 (24m 38s)

[Audio] The topic of this slide is Missing Data. Specifically, Missing Completely at Random, or MCAR. Missingness is independent of any variable, meaning cases with complete data are not significantly different from cases with missing data. This means that missing cases reflect a random subset of the original sample. As an example, consider refusing to provide smoking information on a questionnaire, purely at random. In this case, nothing determines which values will be missing except chance. Dr. Brown and the many TAs have contributed significantly to the data preparation and analysis slides over the years. Their hard work and dedication are much appreciated..

Scene 45 (25m 28s)

[Audio] Missing data is a frequent challenge in data preparation and analysis. It occurs when a value is missing from the dataset for a given variable for some observations. This may be due to various reasons, such as a person not reporting smoking or weight loss following a diet. We gave two examples, grades and year in program and weight lost and weeks following a diet. We are thankful to Dr. Brown and the TAs for their help and guidance over the years in data preparation and analysis processes..

Scene 46 (26m 3s)

[Audio] When dealing with missing data, one type is classified as missing not at random, or MNAR. In contrast to other forms of missing data, MNAR is when the probability of missing data is dependent on the true value of the variable. An example of this is when a survey asks about smoking status; current smokers may be less likely to report their status, causing the pattern of missing values to be related to the potential values the variable might have taken. It is important to be aware of cases of MNAR when conducting data analysis..

Scene 47 (26m 41s)

[Audio] When faced with missing or incomplete data in a project or experiment, it helps to be aware of strategies that can be used to tackle it. These include list-wise deletion, pair-wise deletion and dummy variable adjustment. List-wise deletion involves deleting an entire row when any column has a missing value. Pair-wise deletion makes use of all the available data for a particular variable. Lastly, dummy variable adjustment involves the use of an indicator variable to adjust for the missing data. Knowing these strategies can be useful when encountering incomplete data in any project..

Scene 48 (27m 24s)

[Audio] Data preparation and analysis is critical for any research project. Various imputation techniques are used to manage missing data, such as Cold deck imputation, Hot deck imputation, Random imputation within classes, and Simple hot deck imputation. Cold deck imputation substitutes missing values utilizing group estimates such as the mean. Hot deck imputation uses actual data. Random imputation within classes selects a case with similar characteristics to the missing case and uses that case’s value in substitution. Simple hot deck imputation uses one randomly selected case to substitute for all missing pieces of data. Each approach comes with advantages and disadvantages, and it is important to be aware of them. We are appreciative of Dr. Brown and the TAs who have aided in the development of data preparation and analysis..

Scene 49 (28m 25s)

[Audio] We will be discussing a range of regression methods that can be used to forecast values depending on regression equations with other variables as predictors. Additionally, maximum likelihood methods that incorporate all available data for drawing maximum likelihood-based statistics are also on the agenda. Lastly, we'll examine multiple imputation which unites maximum likelihood to bring about multiple data sets with imputed values for incomplete cases..

Scene 50 (28m 56s)

[Audio] Our advice on missing data is to avoid having missing data altogether. If you do find yourself with missing data, please seek out the help of a biostatistician. We would like to give a special and heartfelt thank you to Dr. Brown, and all the TAs who have worked with us to prepare and analyze data for this presentation. For your final assignment, we recommend that you decide on either exclusion or imputation, stating your chosen method and rationale. We appreciate everyone staying with us during this presentation. Thank you all for your attention..