Comprehensive Statistics Notes Summary & Study Notes
These study notes provide a concise summary of Comprehensive Statistics Notes, covering key concepts, definitions, and examples to help you review quickly and study effectively.
๐ Graph Types and When to Use
Graphs visualize relationships and distributions to reveal patterns in data. The choice of plot depends on the types of variables you are analyzing and the questions you want to answer. Bar plots, histograms, box plots, scatter plots, and line plots each highlight different aspects of data.
-
Bar Plot: Focuses on the frequency of categories in a dataset. Use to assess the distribution of a categorical variable. Traits include spaces between bars to reflect distinct categories.
-
Histogram: Shows the frequency of values for a numerical variable. Use to assess distribution for numerical data; bars touch to reflect continuity. Bin width can be adjusted to reveal different levels of detail.
-
Grouped Bar Plot: Extends bar plots to two categorical variables to explore how one category affects another. Use colour to distinguish categories within one variable.
-
Box Plot: Summarizes the distribution of a numerical variable across different categories. It highlights the central value and spread within each group and can reveal outliers.
-
Scatterplot: Displays the relationship between two numerical variables. Use to assess how one numerical variable affects another and to visualize potential patterns with a trendline.
-
Time (Line) Plot: Plots a numerical variable against time. Useful for identifying trends over time with a single observation per time point connected by lines.
-
Spatial Plot: Examines a variable across space. Use colour or size to denote numerical or categorical values at locations, such as maps or heatmaps.
-
What Plot Should I Use? Determine the variable types to decide the best plot. For each combination of variable types, a plot type is best, as highlighted in guidance. The aim is to visualize the data in a way that makes the underlying pattern clear.
Quick code reminders (ggplot2 syntax)
- Box plot:
geom_boxplot(aes(x, y)) - Bar plot:
geom_bar(aes(x, col = y, group = y)) - Simple bar:
geom_bar(aes(x)) - Scatter with trendline:
geom_point(aes(x, y)) + geom_smooth(aes(x, y), method = \"lm\") - Box plot again:
geom_boxplot(aes(x, y)) - Histogram:
geom_histogram(aes(x)) - Time:
geom_line(aes(time, y)) - Spatial map:
maporgeom_sf() - Heat map:
geom_tile(aes(lat, lon, fill = x))
There are many plot types available, each designed for specific data types and questions. Use the variable types to guide your choice and select the plot that best communicates the result.
๐งญ Interpreting and Reporting Results
When reporting results from statistical analyses, clarity matters. A results section should convey the direction, magnitude, precision, and statistical support of effects to create a precise understanding of the findings.
- Direction: State whether the effect increases or decreases (e.g., a positive slope or a higher mean in one group).
- Magnitude: Provide the estimated size of the effect (e.g., difference in means, odds ratio, or regression slope).
- Precision: Report a Confidence Interval (CI) around the estimate to show uncertainty.
- Statistical Support: Include the test statistic (t, F, z), degrees of freedom, and the p-value in parentheses.
Example for a t-test in words: โInfected hosts lived, on average, half as long as uninfected hosts (20 days vs. 40 days; CI for the difference, Figure 2).โ The sentence should reflect the direction and magnitude, followed by precision and significance values.
-
A typical t-test statement includes: the mean difference, the CI, the t statistic, the degrees of freedom, and the p-value. Match the sign of the test statistic with the described direction (e.g., a positive t corresponds to an increase or larger value).
-
The CI expresses the estimated difference and its uncertainty: e.g., CI: 18โ29 g indicates the true difference could plausibly lie within that range.
-
In ANOVA, report the overall F statistic, degrees of freedom, and p-value to indicate that at least one group differs, followed by post hoc comparisons if needed (e.g., Tukey tests).
-
For regression, report the slope and its test statistic, the CI for the slope, and the R-squared value. The intercept is reported if it provides meaningful interpretation, along with its test statistic. The basic regression model is:
The slope describes the rate of change of the response with the explanatory variable. The line of best fit minimizes the sum of squared residuals (the vertical distances from the observed points to the line).
- In practice, use the R output from
summary(model)to extract: the slope estimate and its SE, the 95% CI computed as , and the overall model statistics. The 95% CI helps determine whether the slope differs from zero.
Reporting formats to remember
- Always provide the context, group names, and the exact variable labels used in the analysis. Correctly label units for response and explanatory variables when reporting slope.
- If the p-value is extremely small, you can report as "p < 0.001" instead of a lengthy number.
- Present the results in a way that a reader can reproduce the reasoning and check the conclusions against the data.
๐งฎ Regressions โ Part 1
A regression describes the slope of the relationship between two numerical variables. The key test is whether the slope differs from zero. If the slope is not significantly different from zero, there is no evidence of a linear relationship.
- The basic form is:
-
The regression relies on three core assumptions: (1) random sampling, (2) a linear relationship between and , and (3) homoscedasticity, i.e., the residuals have constant variance and are normally distributed.
-
The method of least squares finds the line that minimizes the sum of squared residuals.
-
A residual is the distance between an observed value and the predicted value from the line, measured in the units of the response variable.
-
Running a regression in R typically uses
lm()with the format:
modelName <- lm(response ~ explanatory, data = table)
summary(modelName)
-
In the output, focus on: the slope estimate and its standard error, the 95% CI for the slope, and (the proportion of variance explained). The F-statistic tests the overall model, but interpretation often centers on the slope and its CI.
-
If you report an intercept, include its magnitude, SE, and test statistic. In many studies, the intercept is not of primary interest unless you need the predicted value when the explanatory is zero.
-
The formula for the paired t-test differs from the two-sample t-test and is useful when observations are naturally paired.
๐ Summarizing Univariate Data
Understanding a single variable involves both its center and its spread. The main measures are:
-
Mean: the average value, computed as .
-
Median: the middle value when data are ordered; it is less affected by outliers.
-
Mode: the most frequent value in the dataset.
-
Range: the difference between the maximum and minimum values, .
-
Variance: the average squared deviation from the mean, .
-
Standard deviation: the square root of the variance, .
-
Interquartile Range (IQR): the spread of the middle 50% of data, .
-
The choice of metric depends on the distribution. For data that are normally distributed, the mean and standard deviation are informative. For skewed data or outliers, the median and IQR better summarize the data.
-
Always report the corresponding spread with the chosen center (mean with SD, median with IQR) to provide a complete sense of dispersion.
-
When presenting univariate summaries, include a short interpretation of where the center lies and how variable the data are. This helps readers quickly grasp the dataโs core characteristics.
๐งช T-Tests and Their Variants
A t-test assesses whether an estimate differs from a given value or another estimate, accounting for sample variance. The general t-statistic measures how many standard errors the estimate is from the comparison value.
- One-sample t-test: tests whether a single numerical variable differs from an expected value. The test statistic is:
-
Two-sample t-test: tests whether the means of two groups differ. The statistic uses the difference between group means and a pooled standard error.
-
Paired t-test: tests whether the mean difference between paired observations differs from zero. The statistic is:
where is the difference within pairs. The paired test removes variance due to subject-specific differences.
-
In practice, report the test statistic, degrees of freedom, and the p-value. If the p-value is smaller than the chosen alpha (e.g., 0.05), report that the difference is statistically significant.
-
When reporting, clearly state the direction and magnitude of the difference, followed by the CI for the mean difference where available.
-
The following quick checks help interpret results:
- If the 95% CI for a mean difference includes 0, the difference is not statistically significant at the 0.05 level.
- A very small p-value indicates strong evidence against the null hypothesis under the assumptions of the test.
-
Practical examples: a t-test might compare body temperatures to a standard value (one-sample) or compare temperatures between two environmental conditions (two-sample). For paired designs, keep the pairing in the analysis to reduce extraneous variability.
๐งญ Quick Reference: Common Symbols and Equations
-
Mean:
-
Standard deviation:
-
Variance:
-
Median: notated as the middle value in ordered data
-
IQR:
-
Regression line:
-
Correlation-related statistics: often reported with -values and CIs
-
Confidence interval for a slope: (for 95% CI, assuming normality)
-
Coefficient of determination: indicates the proportion of variance explained by the model
-
Key interpretation reminders:
- Match the sign of the test statistic with the direction of the effect.
- Report the exact degrees of freedom alongside the test statistic.
- Use multiple comparisons adjustments (e.g., Tukey) when conducting post hoc tests after ANOVA.
๐งฌ Practical Examples to Reinforce Concepts
- Example 1 (Graph types): A researcher wants to compare the distribution of a categorical variable (animal breed) across regions. A grouped bar plot with color by region would be appropriate to illustrate possible associations.
- Example 2 (Regression): A simple linear regression examines how hours of study () predict exam score () with the model . The slope indicates the expected increase in score per additional hour of study.
- Example 3 (T-test): A one-sample t-test tests whether the average body temperature differs from the standard value of 37ยฐC, reporting , degrees of freedom, and -value to indicate significance.
- Example 4 (Univariate data): A dataset with skewed income values is summarized with mean, SD, and also with median and IQR to capture central tendency and spread without distortion by outliers.
๐ Glossary Snippets
- Mean: The arithmetic average of a set of values.
- Median: The middle value in an ordered dataset.
- Mode: The most frequently occurring value in a dataset.
- IQR: The spread of the middle 50% of data.
- R-squared: Proportion of variance explained by a regression model.
- CI: A range of values within which the true parameter is expected to lie with a given probability.
- Residual: The difference between an observed value and its fitted value on the model.
๐๏ธ Quick Study Prompts
- How do you decide which plot type to use for a given pair of variable types?
- What does a significant F-statistic in ANOVA tell you about group differences?
- How do residuals relate to the goodness of fit in a regression?
- Why might you prefer the median and IQR over the mean and SD for skewed data?
Sign up to read the full notes
It's free โ no credit card required
Already have an account?
Create your own study notes
Turn your PDFs, lectures, and materials into summarized notes with AI. Study smarter, not harder.
Get Started Free