Quantitative Data Analysis and Visualisation — Comprehensive Study Notes Summary & Study Notes
These study notes provide a concise summary of Quantitative Data Analysis and Visualisation — Comprehensive Study Notes, covering key concepts, definitions, and examples to help you review quickly and study effectively.
📊 Descriptive statistics — key concepts
Descriptive statistics summarise numerical data using measures of location and dispersion. Common summaries: mean, median, mode, variance, standard deviation, minimum, maximum, range, quartiles. Visual summaries include histograms and boxplots.
➗ Measures of central tendency and dispersion
- Sample mean: . This is the arithmetic average and is sensitive to outliers.
- Median: the middle value when data are ordered. For even , average the two middle values.
- Mode: the most frequent value(s).
- Sample variance: . Use for an unbiased estimator of population variance.
- Sample standard deviation: .
- Range: . Gives overall spread but is sensitive to extremes.
- Quartiles (Q1, Q2, Q3) split the ordered data into four equal parts; Q2 is the median. These are used in boxplots.
📈 Boxplots and histograms — interpretation
- Histogram: shows the distribution shape and preserves the original data bins. Useful for seeing modality and approximate distribution form.
- Boxplot: highlights median, quartiles, whiskers (approximate spread), and outliers. Good for comparisons across groups.
- Skewness from plots: A longer left whisker or long left tail indicates left-skew, a longer right whisker/right tail indicates right-skew, and symmetric whiskers indicate approximate symmetry.
🧪 Hypothesis testing for difference of two means (large samples)
When comparing two independent groups with large sample sizes (), a z-test for difference of means can be used (using sample s.d. as estimates of population s.d.).
- Null hypothesis: (no difference).
- Alternative (one-sided or two-sided) depending on the research question (e.g. if expecting group 1 larger).
- Test statistic: . Under take .
- Decision: compare to critical value (one-sided) or use p-value from standard normal.
Example (numbers from a practice problem): , , , , . Compute . For a one-sided test at , critical , so reject and conclude group 1 has larger mean.
⚖️ Interpreting test results
- If (two-sided) or (one-sided), reject at level .
- Report conclusion in context (e.g. "statistically significant evidence that young professionals spend more on groceries than retired individuals at the 5% level").
🔢 Chi-squared test for independence (contingency tables)
Use this to test association between two categorical variables (e.g. packaging type and spoilage).
- Observed table: cells , row totals and column totals.
- Expected counts under independence: .
- Chi-squared statistic: .
- Degrees of freedom: for an table.
- Decision: compare to critical value from or use p-value.
Example (numbers from the exam): Observed: Carboard spoiled 100, not spoiled 119 (row total 219); Plastic spoiled 200, not spoiled 153 (row total 353); column totals: spoiled 300, not spoiled 272; total . Expected example: . Calculated with . Critical at is , so reject independence — conclude an association between packaging and spoilage.
📉 Multiple linear regression — interpretation of R output
- Model form: .
- Coefficient interpretation: each coefficient is the expected change in for a one-unit increase in that predictor, holding others constant.
- From example R output: estimated model .
- Significance: use t-statistics and p-values for each coefficient. Very small p-values (e.g. < 0.001) indicate strong evidence the coefficient differs from zero.
- Prediction: plug in predictor values into the fitted equation for a point prediction. Example: for TV=10, radio=10, newspaper=10, (thousand units).
- Residual: observed minus predicted. If observed sales = 10 (thousand), residual (thousand).
- R-squared: proportion of variance explained by the model. Higher indicates better fit but beware overfitting. Prefer simpler models if they explain similar variance (parsimony).
Model selection example: for TV+radio+newspaper = 0.8972 and for TV+radio = 0.8972. The additional predictor (newspaper) gives no improvement in , so prefer the simpler TV+radio model.
🧮 Degrees of freedom and sample size from regression output
- In the linear model output, residual degrees of freedom = where is the number of parameters (including intercept). So .
- Example: residual df = 196, (intercept + 3 predictors), so regions.
🧾 Useful formulas (formula-sheet style)
- Standard error of the sample mean: .
- Standard error for difference of two means: .
- Large-sample CI for mean: .
- Hypothesis test for proportion uses standard error under .
- Chi-squared statistic: .
- Logistic regression form: .
- Poisson regression (with offset ): .
🔎 Reference z / t values (common quantiles)
- One-sided critical: .
- Two-sided critical: .
- Useful R-derived values: , .
✅ Practical tips for exam-style problems
- Always state hypotheses in context and specify whether the test is one-sided or two-sided.
- Show formula, substitute numbers, compute the test statistic, state critical value or p-value, then give a contextual conclusion.
- For contingency tables, check expected counts (all should be reasonably large for the Chi-squared approximation to hold).
- For regression interpretation, comment on significance (p-values), sign and magnitude of coefficients, goodness-of-fit (), and practical uncertainty (residuals, prediction intervals not just point predictions).
These notes summarise the core tools used in the sample exam: descriptive summaries and plots, z-tests for mean differences, Chi-squared tests for independence, and interpretation of multiple linear regression output.
Sign up to read the full notes
It's free — no credit card required
Already have an account?
Create your own study notes
Turn your PDFs, lectures, and materials into summarized notes with AI. Study smarter, not harder.
Get Started Free