What topics are covered in these Statistics: Variables & Inference notes?

These study notes cover key concepts and summaries for Statistics: Variables & Inference.

Are these Statistics: Variables & Inference study notes free?

Yes, you can read these study notes for free on Cramberry.

Statistics: Variables & Inference Summary & Study Notes

These study notes provide a concise summary of Statistics: Variables & Inference, covering key concepts, definitions, and examples to help you review quickly and study effectively.

1.5k words4 views

NotesQuiz

🔬 Variable Fundamentals

In data analysis, you often investigate causal relationships between variables using experiments or observations. The variable you measure is the response variable $Y$ , which responds to the explanatory variable(s) $X$ that you manipulate or observe. There is typically only one response variable per analysis. In plots, the response variable is on the Y-axis and the explanatory variables on the X-axis.

🧭 Explanatory vs Response Variables

Does the weather affect the transmission of flu? Explanatory: Temperature, humidity, wind speed; Response: Flu reproduction rates.
Does wildfire smoke affect deer behavior? Explanatory: Air quality; Response: Deer daily activity.

🗃️ Types of Variables

Categorical vs Numerical

A categorical variable can take on a limited number of values, usually non-numeric, and observations are assigned to groups based on qualitative properties. Examples include gender, colour, species, satisfaction. There are two subtypes that require different handling in analyses: Nominal and Ordinal.
A numerical variable is quantifiable and has numeric values. Examples include height, age, speed, population density. There are three common numerical subtypes to distinguish: Continuous, Discrete, and Binomial.

Nominal vs Ordinal

Nominal: Describes categories without natural order. Examples: favourite colour, species.
Ordinal: Describes categories with a natural order. Examples: shirt sizes (S, M, L), satisfaction scores.

Numerical Subtypes

Continuous: Can take an infinite number of real values within an interval. Examples: temperature, height. All numerical analyses can apply here, with appropriate assumptions.
Discrete: Can take only a finite number of real values. Examples: number of students, gymnastic scores.
Binomial: A specific discrete type that represents two outcomes (0/1). Often treated as a proportion in analyses.

Scale types and practical notes

Interval-scale variables have arbitrary zeros. Examples include temperature and latitude. They can often be analyzed with linear models after careful handling.
Ratio-scale variables have meaningful zeros (e.g., precipitation). They support a full set of arithmetic operations and many standard analyses.

Practical tip: Start by identifying whether a variable is categorical or numerical, then determine if it’s nominal, ordinal, continuous, or discrete to choose appropriate plots and tests.

📈 Plot Types & Statistical Approaches

For categorical variables, use plot types like boxplots, violin plots, strip charts, bar graphs, frequency tables, contingency tables, and count plots. Statistical analyses designed for categorical data include 2-sample t-tests, chi-square tests, and ANOVA when comparing groups after appropriate encoding.
If a variable is nominal or needs to be treated as numerical, you can either use corresponding nominal methods or convert to a numerical scale and use numerical plots.
For numerical variables, use scatterplots, histograms, area plots, density plots, maps, and line graphs. You can apply the full set of statistical methods appropriate to numerical data, with specific transformations if needed.
An ordinal variable can sometimes be treated as numerical by assigning scores (e.g., S = -1, M = 0, L = 1) to enable certain analyses, though interpretation should be cautious.

🧮 Converting Categorical to Numerical Scales (Ordinal Example)

A common approach is to map ordered categories to numerical values to enable numerical methods. For example, convert shirt sizes S, M, L to $-1$ , $0$ , $1$ respectively. This enables linear or regression-like analyses while acknowledging the ordinal nature.
Always interpret results in light of the original ordering and measurement level. In many cases, non-parametric or rank-based methods may be more appropriate for ordinal data.

🧪 ANOVA Essentials

ANOVA compares the variation among group means to the variation within groups. It is essentially a generalisation of the t-test for more than two groups, with the test statistic $F$ defined as the ratio of mean squares between groups to mean squares within groups:

$F = \frac{MS_{between}}{MS_{within}}$

If $F > 1$ , there is more variation between groups than within groups. If $F < 1$ , more variation exists within groups. There is no universal cut-off like 1.96 for $F$ ; thresholds depend on the number of groups and sample size. Decisions rely on the p-value from software output.
To control false positives when testing many groups, use multiple comparison adjustments such as Tukey's Honestly Significant Differences (HSD) test. Tukey’s test performs pairwise comparisons with a correction that accounts for multiple comparisons.
Adjustment of confidence intervals in ANOVA can be done using $\text{CI}_{\text{adjusted}} = 1 - \alpha/g$ , where $g$ is the number of groups and $\alpha$ is the per-comparison error rate. For example, with $\alpha = 0.05$ and $g=10$ , the adjusted CI corresponds to a 0.995 confidence level.

How to do an ANOVA in R (conceptual steps)

Load data and fit a model: modelName <- lm(responseColumn ~ explanatoryColumn, data = tableName)
View overall results: summary(modelName)
Get adjusted pairwise differences: modelResults <- pairwise(modelName) and view with view(modelResults)
Optional ANOVA summary: anova(modelName) for sum of squares and mean squares

Interpreting ANOVA outputs

The tables show the mean differences between groups and the confidence intervals adjusted for multiple comparisons. Look at the mean differences and the adjusted confidence intervals to assess whether groups differ meaningfully.

🔎 Confidence Intervals & Statistical Principles

The goal of statistical inference is to estimate population parameters and quantify our certainty about them. For a population mean, we use the sample mean and its sampling distribution to form a confidence interval (CI).

A 95% CI means the true population mean has a 95% chance of lying within the interval, given the model and assumptions. In practice, there is a 5% chance of missing the true mean due to sampling error.
Key universal principles for normal data:
- About 68% of samples fall within 1 standard error (SE) of the sample mean.
- About 95% of samples fall within 1.96 SEs of the sample mean.
- The confidence interval is the range of values that contain the specified proportion of the population of sample means.
- The standard error is $SE = \frac{s}{\sqrt{n}}$ , where $s$ is the sample standard deviation and $n$ is the sample size.

How to calculate a CI

Calculate the sample mean $\bar{x}$ and standard deviation $s$ .
Compute the standard error $SE = \frac{s}{\sqrt{n}}$ .
Determine the appropriate conversion factor for your CI level (e.g., 1.96 for 95% CI with large $n$ ).
Compute: Upper = $\bar{x} + SE \times$ factor, Lower = $\bar{x} - SE \times$ factor.

For small samples (or non-normal data), use the $t$ -distribution: the factor is $t_{(1-\alpha/2, ; n-1)}$ , obtainable via functions like $qt$ (...) in R.

Example

You weigh 100 deer mice and find a mean weight of 11.3 g with standard deviation 1.2 g. Then $SE = \frac{1.2}{\sqrt{100}} = 0.12$ g. Using a 95% CI with $n > 30$ , the factor is 1.96. So the 95% CI is $11.3 \pm 0.12 \times 1.96$ , i.e. from approximately 11.06 g to 11.53 g. You can say with 95% confidence that the true mean lies in this interval.

🧠 Shapiro-Wilk Normality Test

Normality is a key assumption for many tests. The Shapiro-Wilk test yields a statistic $W$ and a p-value. A small p-value (p ≤ 0.05) suggests the data deviate from normality, while p > 0.05 indicates the data can be treated as normally distributed for practical purposes.

🔗 Correlations

Correlations measure the degree of association between two numerical variables and answer whether, as one variable changes, the other tends to change in the same direction.

The correlation coefficient $r$ ranges from $-1$ to $1$ . A value near $0$ indicates little to no linear association, while values near $\pm 1$ indicate strong linear association. Positive $r$ means variables move in the same direction; negative $r$ means they move in opposite directions.
Two main types:
- Pearson’s correlation coefficient is used when both variables are normally distributed.
- Spearman’s correlation coefficient is used when at least one variable is not normally distributed; it ranks the data before computing the correlation.
Calculation in R: cor(x, y, method = "pearson") or cor(x, y, method = "spearman") depending on normality.
Important caveat: correlation does not convey the magnitude of change (slope) or causation; it only reflects association.

📈 Distributions Overview

Understanding the distribution of each variable helps determine appropriate analyses. There are six common distribution types:

Normal: Peak at the center, symmetric, unimodal.
Right Skewed: Peak at the low end with a tail to the right, unimodal.
Left Skewed: Peak at the high end with a tail to the left, unimodal.
Bimodal: Two peaks, suggesting structure or subgroups.
Uniform: Data spread evenly with no peak.
Multimodal: More than two peaks, indicating complex structure.

Practical Notes

The distribution informs the choice of tests and transformations. Normal distributions support many parametric tests, while skewed or multimodal data may require non-parametric methods or data transformation.
Always plot and summarize distributions before formal testing to detect outliers and deviations from assumptions.

It's free — no credit card required

Already have an account?

Continue learning

Explore other study materials generated from the same source content. Each format reinforces your understanding of Statistics: Variables & Inference in a different way.

Practice Quiz

Test your understanding

Create your own study notes

Turn your PDFs, lectures, and materials into summarized notes with AI. Study smarter, not harder.

Get Started Free