Here you will find further subjects in statistics that are not covered in the First Steps and Next Steps pages. If there are subjects missing that you would like to see included, using the link at the bottom of the page.
THIS PAGE IS UNDER CONSTRUCTION. MORE RESOURCES COMING SOON
For information on how to undertake a particular analysis using statistical software, see the relevant resource page. We are constantly working to Excellerate and add to these resources. If you have any suggestions or feedback, let us know.
The program is designed to produce highly skilled, versatile statisticians and data scientists who possess powerful abilities for analyzing data. As such, SDS students learn not only how to build statistical models that generate predictions, but how to validate these models and interpret their parameters. Students learn to use their ingenuity to “wrangle” with complex data streams and construct informative data visualizations.
The major in statistical & data sciences consists of 10 courses, including depth in both statistics and computer science, an integrating course in data science, a course that emphasizes communication and an application domain of expertise. All but the application domain course must be graded; the application course can be taken S/U.
Advisers
Benjamin Baumer, Shiya Cao, Kaitlyn Cook, Randi Garcia, Albert Y. Kim, Katherine Kinnaird, Scott LaCombe, Lindsay Poirier. If you wish to declare an SDS major and need an advisor, please fill out this form at https://bit.ly/sds_advisor.
Study Abroad Adviser
Scott LaCombe
See the major diagram below for prerequisites, and see the Note on course substitutions following the description of the major.
Please consult our continuously-updated, nonexhaustive list of previously approved application domain courses, which includes:
A student and their adviser should identify potential application domains of interest as early as possible, since many suitable courses will have prerequisites. Normally, this should happen during the fourth semester or at the time of major declaration, whichever comes first. The determination of whether a course satisfies the requirement will be made by the student’s major adviser.
Notes on course substitutions:
The Major in Mathematical Statistics
Students interested in doctoral programs in Statistics should consider the Major in Mathematical Statistics jointly operated by SDS and MTH.
Access to a Windows PC and an approved statistics package is required for analysis of data.
Learning outcomes can change before the start of the semester you are studying the course in.
Assessment weightings can change up to the start of the semester the course is delivered in.
You may need to take more assessments depending on where, how, and when you choose to take this course.
In 2017, nearly 38,000 persons of working age (16–64 years) in the United States died by suicide, which represents a 40% rate increase (12.9 per 100,000 population in 2000 to 18.0 in 2017) in less than 2 decades.* To inform suicide prevention, CDC analyzed suicide data by industry and occupation among working-age decedents presumed to be employed at the time of death from the 32 states participating in the 2016 National Violent Death Reporting System (NVDRS).^{†,§} Compared with rates in the total study population, suicide rates were significantly higher in five major industry groups: 1) Mining, Quarrying, and Oil and Gas Extraction (males); 2) Construction (males); 3) Other Services (e.g., automotive repair) (males); 4) Agriculture, Forestry, Fishing, and Hunting (males); and 5) Transportation and Warehousing (males and females). Rates were also significantly higher in six major occupational groups: 1) Construction and Extraction (males and females); 2) Installation, Maintenance, and Repair (males); 3) Arts, Design, Entertainment, Sports, and Media (males); 4) Transportation and Material Moving (males and females); 5) Protective Service (females); and 6) Healthcare Support (females). Rates for detailed occupational groups (e.g., Electricians or Carpenters within the Construction and Extraction major group) are presented and provide insight into the differences in suicide rates within major occupational groups. CDC's Preventing Suicide: A Technical Package of Policy, Programs, and Practices^{[1]} contains strategies to prevent suicide and is a resource for communities, including workplace settings.
NVDRS combines data on violent deaths, including suicide, from death certificates, coroner/medical examiner reports, and law enforcement reports. Industry and occupation coding experts used CDC's National Institute for Occupational Safety and Health Industry and Occupation Computerized Coding System (NIOCCS 3.0)^{¶} to assign 2010 U.S. Census civilian industry and occupation codes for 20,975 suicide decedents aged 16–64 years from the 32 states participating in the 2016 NVDRS, using decedents' usual industry and occupation as reported on death certificates. Industry (the business activity of a person's employer or, if self-employed, their own business) and occupation (a person's job or the type of work they do) are distinct ways to categorize employment.^{[2]}
Suicide rates were analyzed for industry and occupational groups by sex. Population counts by occupation for rate denominators were states' civilian, noninstitutionalized current job population counts (for persons aged 16–64 years) from the 2016 American Community Survey Public Use Microdata Sample.** Replicate weight standard errors for those counts were used to calculate 95% confidence intervals (CIs) for suicide rates.^{[3]} Rates were calculated by U.S. Census code for major industry groups, major occupational groups, and detailed occupational groups with ≥20 decedents; detailed occupational groups are typically more homogenous in terms of employee income, work environment, and peer group. Rates were not calculated for detailed industry groups because many decedents' industry was classifiable only by major group. The following decedents were excluded from rate calculations: military workers (327); unpaid workers (2,863); those whose other NVDRS data sources (e.g., law enforcement reports) indicated no employment at time of death (i.e., unemployed, disabled, incarcerated, homemaker, or student)^{[4]} (1,783); and those not residing in the analysis states (223). A total of 15,779 decedents, including 12,505 (79%) males and 3,274 (21%) females, were included in the analysis. The analysis was conducted using Stata (version 15, StataCorp) and SAS (version 9.4, SAS Institute) statistical software.
Industry and occupational groups with suicide rates significantly (α = 0.05) higher than the study population (i.e., all industries or occupations: 27.4 males [95% CI = 26.9–27.9] and 7.7 females [95% CI = 7.5–8.0] per 100,000 population) were identified when the group's 95% CI exceeded the study population rate point estimate. Treating the population rate as a constant is reasonable when variance is small and is required for one-sample inference that recognizes the nonindependence of individual industry and occupation groups relative to the study population.
The five major industry groups with suicide rates higher than the study population by sex included 1) Mining, Quarrying, and Oil and Gas Extraction (males: 54.2 per 100,000 civilian noninstitutionalized working population, 95% CI = 44.0–64.3); 2) Construction (males: 45.3, 95% CI = 43.4–47.2); 3) Other Services (e.g., automotive repair; males: 39.1, 95% CI = 36.1–42.0); 4) Agriculture, Forestry, Fishing, and Hunting (males: 36.1, 95% CI = 31.7–40.5); and 5) Transportation and Warehousing (males: 29.8, 95% CI = 27.8–31.9; females: 10.1, 95% CI = 7.9–12.8) (Table 1) (Supplementary Table 1, https://stacks.cdc.gov/view/cdc/84274). The six major occupational groups with higher rates included 1) Construction and Extraction (males: 49.4, 95% CI = 47.2–51.6; females: 25.5, 95% CI = 15.7–39.4); 2) Installation, Maintenance, and Repair (males: 36.9, 95% CI = 34.6–39.3); 3) Arts, Design, Entertainment, Sports, and Media (males: 32.0, 95% CI = 28.2–35.8); 4) Transportation and Material Moving (males: 30.4, 95% CI = 28.8–32.0; females: 12.5, 95% CI = 10.2–14.7); 5) Protective Service (females: 14.0, 95% CI = 9.9–19.2); and 6) Healthcare Support (females: 10.6, 95% CI = 9.2–12.1).
Rates could be calculated for 118 detailed occupational groups for males and 32 for females (Supplementary Table 2, https://stacks.cdc.gov/view/cdc/84275). Some occupational groups with suicide rates significantly higher than those of the study population were only identifiable through observation at the detailed group level (Table 2). Among males, these detailed groups included the following seven groups: 1) Fishing and hunting workers (part of the Farming, Fishing, and Forestry major occupational group); 2) Machinists (Production major group); 3) Welding, soldering, and brazing workers (Production major group); 4) Chefs and head cooks (Food Preparation and Serving Related major group); 5) Construction managers (Management major group); 6) Farmers, ranchers, and other agricultural managers (Management major group); and 7) Retail salespersons (Sales and Related major group). Among females, these detailed groups included the following five groups: 1) Artists and related workers (Arts, Design, Entertainment, Sports, and Media major group); 2) Personal care aides (Personal Care and Service major group); 3) Retail salespersons (Sales and Related major group); 4) Waiters and waitresses (Food Preparation and Serving Related major group); and 5) Registered nurses (Healthcare Practitioners and Technical major group). Groups with highest rate point estimates (e.g., female Artists and related workers and male Fishing and hunting workers) also had wide 95% CIs (Table 2), based on relatively low numbers of decedents and relatively small working populations (Supplementary Table 2, https://stacks.cdc.gov/view/cdc/84275).
One of Pearson's most significant achievements occurred in 1900, when he developed a statistical test called Pearson's chi-square (Χ^{2}) test, also known as the chi-square test for goodness-of-fit (Pearson, 1900). Pearson's chi-square test is used to examine the role of chance in producing deviations between observed and expected values. The test depends on an extrinsic hypothesis, because it requires theoretical expected values to be calculated. The test indicates the probability that chance alone produced the deviation between the expected and the observed values (Pierce, 2005). When the probability calculated from Pearson's chi-square test is high, it is assumed that chance alone produced the difference. Conversely, when the probability is low, it is assumed that a significant factor other than chance produced the deviation.
In 1912, J. Arthur Harris applied Pearson's chi-square test to examine Mendelian ratios (Harris, 1912). It is important to note that when Gregor Mendel studied inheritance, he did not use statistics, and neither did Bateson, Saunders, Punnett, and Morgan during their experiments that discovered genetic linkage. Thus, until Pearson's statistical tests were applied to biological data, scientists judged the goodness of fit between theoretical and observed experimental results simply by inspecting the data and drawing conclusions (Harris, 1912). Although this method can work perfectly if one's data exactly matches one's predictions, scientific experiments often have variability associated with them, and this makes statistical tests very useful.
The chi-square value is calculated using the following formula:
Using this formula, the difference between the observed and expected frequencies is calculated for each experimental outcome category. The difference is then squared and divided by the expected frequency. Finally, the chi-square values for each outcome are summed together, as represented by the summation sign (Σ).
Pearson's chi-square test works well with genetic data as long as there are enough expected values in each group. In the case of small samples (less than 10 in any category) that have 1 degree of freedom, the test is not reliable. (Degrees of freedom, or df, will be explained in full later in this article.) However, in such cases, the test can be corrected by using the Yates correction for continuity, which reduces the absolute value of each difference between observed and expected frequencies by 0.5 before squaring. Additionally, it is important to remember that the chi-square test can only be applied to numbers of progeny, not to proportions or percentages.
Now that you know the rules for using the test, it's time to consider an example of how to calculate Pearson's chi-square. Recall that when Mendel crossed his pea plants, he learned that tall (T) was dominant to short (t). You want to confirm that this is correct, so you start by formulating the following null hypothesis: In a cross between two heterozygote (Tt) plants, the offspring should occur in a 3:1 ratio of tall plants to short plants. Next, you cross the plants, and after the cross, you measure the characteristics of 400 offspring. You note that there are 305 tall pea plants and 95 short pea plants; these are your observed values. Meanwhile, you expect that there will be 300 tall plants and 100 short plants from the Mendelian ratio.
You are now ready to perform statistical analysis of your results, but first, you have to choose a critical value at which to reject your null hypothesis. You opt for a critical value probability of 0.01 (1%) that the deviation between the observed and expected values is due to chance. This means that if the probability is less than 0.01, then the deviation is significant and not due to chance, and you will reject your null hypothesis. However, if the deviation is greater than 0.01, then the deviation is not significant and you will not reject the null hypothesis.
So, should you reject your null hypothesis or not? Here's a summary of your observed and expected data:
Tall | Short | |
Expected | 300 | 100 |
Observed | 305 | 95 |
Now, let's calculate Pearson's chi-square:
Next, you determine the probability that is associated with your calculated chi-square value. To do this, you compare your calculated chi-square value with theoretical values in a chi-square table that has the same number of degrees of freedom. Degrees of freedom represent the number of ways in which the observed outcome categories are free to vary. For Pearson's chi-square test, the degrees of freedom are equal to n - 1, where n represents the number of different expected phenotypes (Pierce, 2005). In your experiment, there are two expected outcome phenotypes (tall and short), so n = 2 categories, and the degrees of freedom equal 2 - 1 = 1. Thus, with your calculated chi-square value (0.33) and the associated degrees of freedom (1), you can determine the probability by using a chi-square table (Table 1).
Table 1: Chi-Square Table
Degrees of Freedom (df) |
Probability (P) | |||||||||
0.995 | 0.99 | 0.975 | 0.95 | 0.90 | 0.10 | 0.05 | 0.025 | 0.01 | 0.005 | |
1 | --- | --- | 0.001 | 0.004 | 0.016 | 2.706 | 3.841 | 5.024 | 6.635 | 7.879 |
2 | 0.010 | 0.020 | 0.051 | 0.103 | 0.211 | 4.605 | 5.991 | 7.378 | 9.210 | 10.597 |
3 | 0.072 | 0.115 | 0.216 | 0.352 | 0.584 | 6.251 | 7.815 | 9.348 | 11.345 | 12.838 |
4 | 0.207 | 0.297 | 0.484 | 0.711 | 1.064 | 7.779 | 9.488 | 11.143 | 13.277 | 14.860 |
5 | 0.412 | 0.554 | 0.831 | 1.145 | 1.610 | 9.236 | 11.070 | 12.833 | 15.086 | 16.750 |
6 | 0.676 | 0.872 | 1.237 | 1.635 | 2.204 | 10.645 | 12.592 | 14.449 | 16.812 | 18.548 |
7 | 0.989 | 1.239 | 1.690 | 2.167 | 2.833 | 12.017 | 14.067 | 16.013 | 18.475 | 20.278 |
8 | 1.344 | 1.646 | 2.180 | 2.733 | 3.490 | 13.362 | 15.507 | 17.535 | 20.090 | 21.955 |
9 | 1.735 | 2.088 | 2.700 | 3.325 | 4.168 | 14.684 | 16.919 | 19.023 | 21.666 | 23.589 |
10 | 2.156 | 2.558 | 3.247 | 3.940 | 4.865 | 15.987 | 18.307 | 20.483 | 23.209 | 25.188 |
11 | 2.603 | 3.053 | 3.816 | 4.575 | 5.578 | 17.275 | 19.675 | 21.920 | 24.725 | 26.757 |
12 | 3.074 | 3.571 | 4.404 | 5.226 | 6.304 | 18.549 | 21.026 | 23.337 | 26.217 | 28.300 |
13 | 3.565 | 4.107 | 5.009 | 5.892 | 7.042 | 19.812 | 22.362 | 24.736 | 27.688 | 29.819 |
14 | 4.075 | 4.660 | 5.629 | 6.571 | 7.790 | 21.064 | 23.685 | 26.119 | 29.141 | 31.319 |
15 | 4.601 | 5.229 | 6.262 | 7.261 | 8.547 | 22.307 | 24.996 | 27.488 | 30.578 | 32.801 |
16 | 5.142 | 5.812 | 6.908 | 7.962 | 9.312 | 23.542 | 26.296 | 28.845 | 32.000 | 34.267 |
17 | 5.697 | 6.408 | 7.564 | 8.672 | 10.085 | 24.769 | 27.587 | 30.191 | 33.409 | 35.718 |
18 | 6.265 | 7.015 | 8.231 | 9.390 | 10.865 | 25.989 | 28.869 | 31.526 | 34.805 | 37.156 |
19 | 6.844 | 7.633 | 8.907 | 10.117 | 11.651 | 27.204 | 30.144 | 32.852 | 36.191 | 38.582 |
20 | 7.434 | 8.260 | 9.591 | 10.851 | 12.443 | 28.412 | 31.410 | 34.170 | 37.566 | 39.997 |
21 | 8.034 | 8.897 | 10.283 | 11.591 | 13.240 | 29.615 | 32.671 | 35.479 | 38.932 | 41.401 |
22 | 8.643 | 9.542 | 10.982 | 12.338 | 14.041 | 30.813 | 33.924 | 36.781 | 40.289 | 42.796 |
23 | 9.260 | 10.196 | 11.689 | 13.091 | 14.848 | 32.007 | 35.172 | 38.076 | 41.638 | 44.181 |
24 | 9.886 | 10.856 | 12.401 | 13.848 | 15.659 | 33.196 | 36.415 | 39.364 | 42.980 | 45.559 |
25 | 10.520 | 11.524 | 13.120 | 14.611 | 16.473 | 34.382 | 37.652 | 40.646 | 44.314 | 46.928 |
26 | 11.160 | 12.198 | 13.844 | 15.379 | 17.292 | 35.563 | 38.885 | 41.923 | 45.642 | 48.290 |
27 | 11.808 | 12.879 | 14.573 | 16.151 | 18.114 | 36.741 | 40.113 | 43.195 | 46.963 | 49.645 |
28 | 12.461 | 13.565 | 15.308 | 16.928 | 18.939 | 37.916 | 41.337 | 44.461 | 48.278 | 50.993 |
29 | 13.121 | 14.256 | 16.047 | 17.708 | 19.768 | 39.087 | 42.557 | 45.722 | 49.588 | 52.336 |
30 | 13.787 | 14.953 | 16.791 | 18.493 | 20.599 | 40.256 | 43.773 | 46.979 | 50.892 | 53.672 |
40 | 20.707 | 22.164 | 24.433 | 26.509 | 29.051 | 51.805 | 55.758 | 59.342 | 63.691 | 66.766 |
50 | 27.991 | 29.707 | 32.357 | 34.764 | 37.689 | 63.167 | 67.505 | 71.420 | 76.154 | 79.490 |
60 | 35.534 | 37.485 | 40.482 | 43.188 | 46.459 | 74.397 | 79.082 | 83.298 | 88.379 | 91.952 |
70 | 43.275 | 45.442 | 48.758 | 51.739 | 55.329 | 85.527 | 90.531 | 95.023 | 100.425 | 104.215 |
80 | 51.172 | 53.540 | 57.153 | 60.391 | 64.278 | 96.578 | 101.879 | 106.629 | 112.329 | 116.321 |
90 | 59.196 | 61.754 | 65.647 | 69.126 | 73.291 | 107.565 | 113.145 | 118.136 | 124.116 | 128.299 |
100 | 67.328 | 70.065 | 74.222 | 77.929 | 82.358 | 118.498 | 124.342 | 129.561 | 135.807 | 140.169 |
Not Significant & Do Not Reject Hypothesis |
Significant & Reject Hypothesis |
(Table adapted from Jones, 2008)
Note that the chi-square table is organized with degrees of freedom (df) in the left column and probabilities (P) at the top. The chi-square values associated with the probabilities are in the center of the table. To determine the probability, first locate the row for the degrees of freedom for your experiment, then determine where the calculated chi-square value would be placed among the theoretical values in the corresponding row.
At the beginning of your experiment, you decided that if the probability was less than 0.01, you would reject your null hypothesis because the deviation would be significant and not due to chance. Now, looking at the row that corresponds to 1 degree of freedom, you see that your calculated chi-square value of 0.33 falls between 0.016, which is associated with a probability of 0.9, and 2.706, which is associated with a probability of 0.10. Therefore, there is between a 10% and 90% probability that the deviation you observed between your expected and the observed numbers of tall and short plants is due to chance. In other words, the probability associated with your chi-square value is much greater than the critical value of 0.01. This means that we will not reject our null hypothesis, and the deviation between the observed and expected results is not significant.
AMAP's work-in-progress regular seminar series are held on the first Tuesday or Wednesday of the month, where presenters bring their current work in various stages of completion. Presenters may ask advice of the group for a problem they are stuck on, present on a new method they are working out, or may come seeking practice talking about their work from a methodological angle. Presentations do not have to be on methods projects -- in fact, we encourage presentations of substantive research projects; however, we do encourage presenters to discuss their methods in-depth during their presentations. The atmosphere is friendly and supportive. Usually, we schedule faculty members for the entire brown bag time. We usually schedule two graduate students per brown bag, with each graduate student receiving 30 minutes to both present and to receive feedback (in general, we encourage graduate students to present for about 15 minutes to leave about 15 minutes for Q&A). If you are interested in presenting at a work-in-progress session, please contact AMAP@purdue.edu.
AMAP hosts a variety of one and two-day workshops on a wide range of advanced methodological and statistical topics. The workshops are designed to provide participants -- which can include faculty, graduate, and undergraduate students -- with supplemental training on advanced subjects in quantitative and qualitative methods. The workshops are designed around methods that can be adequately covered in a single day and/or methods rarely covered in most graduate courses but that are useful for applied researchers.
AMAP also hosts various other events such as invited lectures, symposiums, and receptions. These are great meet-and-greets that allow for networking both within and outside of Purdue.
For upcoming events, click here.
For past events, click here.
For other related events, click here.
Go to the workshop resources page for recorded AMAP workshops and slides.
*Email AMAP@purdue.edu if you are interested in presenting your work in progress at our work-in-progress series or if you are interested in offering a workshop.
Rani has over 25 years experience as a clinician and statistical research consultant in the public, private, and government sectors. Rani has been a consultant at Boston College since 1998 and has customized discipline specific and general statistics courses for faculty and graduate students on a variety of statistical subjects including introductions to SPSS, Stata, and SAS, Regression, Survival Analysis, HLM, AMOS, and in conjunction with the O'Neill Library Staff: "Access to Dataset Repositories for Social Science Research."
Rani's current research interest include Gerontology, Measures of Psychological Resiliency in Adolescents and Adults, Quantifying Success Predictors in Hospital Based Social Work Practice, and Quantifying Success Predictors for Homeschooling and Distance Learning in K through 12 students.
"Case Management as Management" with Dr. Nancy Veeder, Journal of Social Service Research, St. Louis, MO, January 2005.
The concepts and methods of Statistical Physics play a key role, not always fully perceived, in all branches of Physics. With this textbook, aimed primarily at advanced undergraduates but useful also for experienced researchers, Heissenberg and Sagnotti explain clearly and convincingly why it is so. Besides presenting a modern exposition of the basic facts of Statistical Physics well equipped with problems, a stimulating and broad range of advanced subjects is introduced, whetting the appetite of the determined reader and pushing them to go farther to Quantum Field Theory and Mathematical Physics.'
Prof. Roberto Raimondi - Università Roma Tre
'In its presentation of statistical mechanics, this book is unique for its emphasis on the quantum mechanical underpinnings. It would make a great text for a multi-disciplinary course on many-body physics for upper-division undergraduates or beginning graduate students. Even in the more-elementary first half, the book is full of underappreciated gems, and gives glimpses of a broad view of Theoretical Physics as a whole. The second half boasts a uniform and elementary treatment of the Onsager solution, the Bethe ansatz, the Renormalization Group, and the approach to equilibrium.'
Prof. John McGreevy - University of California, San Diego
Access to a Windows PC and an approved statistics package is required for analysis of data.
Learning outcomes can change before the start of the semester you are studying the course in.
Assessment weightings can change up to the start of the semester the course is delivered in.
You may need to take more assessments depending on where, how, and when you choose to take this course.