Statistics: Variables & Inference Summary & Study Notes
These study notes provide a concise summary of Statistics: Variables & Inference, covering key concepts, definitions, and examples to help you review quickly and study effectively.
🔬 Variable Fundamentals
In data analysis, you often investigate causal relationships between variables using experiments or observations. The variable you measure is the response variable , which responds to the explanatory variable(s) that you manipulate or observe. There is typically only one response variable per analysis. In plots, the response variable is on the Y-axis and the explanatory variables on the X-axis.
🧭 Explanatory vs Response Variables
- Does the weather affect the transmission of flu? Explanatory: Temperature, humidity, wind speed; Response: Flu reproduction rates.
- Does wildfire smoke affect deer behavior? Explanatory: Air quality; Response: Deer daily activity.
🗃️ Types of Variables
Categorical vs Numerical
- A categorical variable can take on a limited number of values, usually non-numeric, and observations are assigned to groups based on qualitative properties. Examples include gender, colour, species, satisfaction. There are two subtypes that require different handling in analyses: Nominal and Ordinal.
- A numerical variable is quantifiable and has numeric values. Examples include height, age, speed, population density. There are three common numerical subtypes to distinguish: Continuous, Discrete, and Binomial.
Nominal vs Ordinal
- Nominal: Describes categories without natural order. Examples: favourite colour, species.
- Ordinal: Describes categories with a natural order. Examples: shirt sizes (S, M, L), satisfaction scores.
Numerical Subtypes
- Continuous: Can take an infinite number of real values within an interval. Examples: temperature, height. All numerical analyses can apply here, with appropriate assumptions.
- Discrete: Can take only a finite number of real values. Examples: number of students, gymnastic scores.
- Binomial: A specific discrete type that represents two outcomes (0/1). Often treated as a proportion in analyses.
Scale types and practical notes
- Interval-scale variables have arbitrary zeros. Examples include temperature and latitude. They can often be analyzed with linear models after careful handling.
- Ratio-scale variables have meaningful zeros (e.g., precipitation). They support a full set of arithmetic operations and many standard analyses.
Practical tip: Start by identifying whether a variable is categorical or numerical, then determine if it’s nominal, ordinal, continuous, or discrete to choose appropriate plots and tests.
📈 Plot Types & Statistical Approaches
- For categorical variables, use plot types like boxplots, violin plots, strip charts, bar graphs, frequency tables, contingency tables, and count plots. Statistical analyses designed for categorical data include 2-sample t-tests, chi-square tests, and ANOVA when comparing groups after appropriate encoding.
- If a variable is nominal or needs to be treated as numerical, you can either use corresponding nominal methods or convert to a numerical scale and use numerical plots.
- For numerical variables, use scatterplots, histograms, area plots, density plots, maps, and line graphs. You can apply the full set of statistical methods appropriate to numerical data, with specific transformations if needed.
- An ordinal variable can sometimes be treated as numerical by assigning scores (e.g., S = -1, M = 0, L = 1) to enable certain analyses, though interpretation should be cautious.
🧮 Converting Categorical to Numerical Scales (Ordinal Example)
- A common approach is to map ordered categories to numerical values to enable numerical methods. For example, convert shirt sizes S, M, L to , , respectively. This enables linear or regression-like analyses while acknowledging the ordinal nature.
- Always interpret results in light of the original ordering and measurement level. In many cases, non-parametric or rank-based methods may be more appropriate for ordinal data.
🧪 ANOVA Essentials
ANOVA compares the variation among group means to the variation within groups. It is essentially a generalisation of the t-test for more than two groups, with the test statistic defined as the ratio of mean squares between groups to mean squares within groups:
- If , there is more variation between groups than within groups. If , more variation exists within groups. There is no universal cut-off like 1.96 for ; thresholds depend on the number of groups and sample size. Decisions rely on the p-value from software output.
- To control false positives when testing many groups, use multiple comparison adjustments such as Tukey's Honestly Significant Differences (HSD) test. Tukey’s test performs pairwise comparisons with a correction that accounts for multiple comparisons.
- Adjustment of confidence intervals in ANOVA can be done using , where is the number of groups and is the per-comparison error rate. For example, with and , the adjusted CI corresponds to a 0.995 confidence level.
How to do an ANOVA in R (conceptual steps)
- Load data and fit a model:
modelName <- lm(responseColumn ~ explanatoryColumn, data = tableName) - View overall results:
summary(modelName) - Get adjusted pairwise differences:
modelResults <- pairwise(modelName)and view withview(modelResults) - Optional ANOVA summary:
anova(modelName)for sum of squares and mean squares
Interpreting ANOVA outputs
- The tables show the mean differences between groups and the confidence intervals adjusted for multiple comparisons. Look at the mean differences and the adjusted confidence intervals to assess whether groups differ meaningfully.
🔎 Confidence Intervals & Statistical Principles
The goal of statistical inference is to estimate population parameters and quantify our certainty about them. For a population mean, we use the sample mean and its sampling distribution to form a confidence interval (CI).
- A 95% CI means the true population mean has a 95% chance of lying within the interval, given the model and assumptions. In practice, there is a 5% chance of missing the true mean due to sampling error.
- Key universal principles for normal data:
- About 68% of samples fall within 1 standard error (SE) of the sample mean.
- About 95% of samples fall within 1.96 SEs of the sample mean.
- The confidence interval is the range of values that contain the specified proportion of the population of sample means.
- The standard error is , where is the sample standard deviation and is the sample size.
How to calculate a CI
- Calculate the sample mean and standard deviation .
- Compute the standard error .
- Determine the appropriate conversion factor for your CI level (e.g., 1.96 for 95% CI with large ).
- Compute: Upper = factor, Lower = factor.
- For small samples (or non-normal data), use the -distribution: the factor is , obtainable via functions like (...) in R.
Example
You weigh 100 deer mice and find a mean weight of 11.3 g with standard deviation 1.2 g. Then g. Using a 95% CI with , the factor is 1.96. So the 95% CI is , i.e. from approximately 11.06 g to 11.53 g. You can say with 95% confidence that the true mean lies in this interval.
🧠 Shapiro-Wilk Normality Test
Normality is a key assumption for many tests. The Shapiro-Wilk test yields a statistic and a p-value. A small p-value (p ≤ 0.05) suggests the data deviate from normality, while p > 0.05 indicates the data can be treated as normally distributed for practical purposes.
🔗 Correlations
Correlations measure the degree of association between two numerical variables and answer whether, as one variable changes, the other tends to change in the same direction.
- The correlation coefficient ranges from to . A value near indicates little to no linear association, while values near indicate strong linear association. Positive means variables move in the same direction; negative means they move in opposite directions.
- Two main types:
- Pearson’s correlation coefficient is used when both variables are normally distributed.
- Spearman’s correlation coefficient is used when at least one variable is not normally distributed; it ranks the data before computing the correlation.
- Calculation in R:
cor(x, y, method = "pearson")orcor(x, y, method = "spearman")depending on normality. - Important caveat: correlation does not convey the magnitude of change (slope) or causation; it only reflects association.
📈 Distributions Overview
Understanding the distribution of each variable helps determine appropriate analyses. There are six common distribution types:
- Normal: Peak at the center, symmetric, unimodal.
- Right Skewed: Peak at the low end with a tail to the right, unimodal.
- Left Skewed: Peak at the high end with a tail to the left, unimodal.
- Bimodal: Two peaks, suggesting structure or subgroups.
- Uniform: Data spread evenly with no peak.
- Multimodal: More than two peaks, indicating complex structure.
Practical Notes
- The distribution informs the choice of tests and transformations. Normal distributions support many parametric tests, while skewed or multimodal data may require non-parametric methods or data transformation.
- Always plot and summarize distributions before formal testing to detect outliers and deviations from assumptions.
Sign up to read the full notes
It's free — no credit card required
Already have an account?
Continue learning
Explore other study materials generated from the same source content. Each format reinforces your understanding of Statistics: Variables & Inference in a different way.
Create your own study notes
Turn your PDFs, lectures, and materials into summarized notes with AI. Study smarter, not harder.
Get Started Free