Quantitative Data Analysis and Visualisation — Study Notes Summary & Study Notes
These study notes provide a concise summary of Quantitative Data Analysis and Visualisation — Study Notes, covering key concepts, definitions, and examples to help you review quickly and study effectively.
📊 Descriptive statistics — key summaries
Mean, median, mode, variance, standard deviation summarise central tendency and spread. The mean is the arithmetic average; the median is the middle value; the mode is the most frequent value. The variance measures average squared deviation; the standard deviation is its square root and has the same units as the data.
Example (library visits, n = 23): mean ≈ , median = , mode = , standard deviation ≈ , variance ≈ , min = , max = , range = , 1st quartile (Q1) = , 3rd quartile (Q3) = .
📈 Visualisation — histogram vs boxplot
Histogram: shows the distribution shape and preserves the original values (useful for seeing multimodality and counts per bin).
Boxplot: summarises median, quartiles and whiskers; highlights outliers and makes group comparisons easy. It does not show the detailed shape inside bins.
Practical tip: use a histogram to inspect the detailed shape and a boxplot to summarise and compare groups.
🔍 Skewness and interpretation
Skewness describes asymmetry of the distribution. A left-skewed (negative skew) distribution has a longer left tail and median closer to the upper quartile. For the library visits example the distribution is left-skewed (longer lower whisker and a lower outlier at 3).
🧪 Hypothesis testing — two-sample comparison of means (large samples)
State null and alternative clearly. Example for comparing weekly grocery spending between young professionals (group 1) and retired individuals (group 2):
- H0: (no difference)
- H1: (young professionals spend more) — one-sided test in this scenario.
When sample sizes are large, a -test can be used with sample standard deviations as estimates. The test statistic is
.
Example numbers: , , , , . Calculated , one-sided p-value ≈ . Conclusion: at reject H0; there is statistically significant evidence that young professionals spend more on groceries on average.
📐 Contingency tables and the Chi-Squared test of independence
Use a Chi-Squared test of independence to check association between two categorical variables. For a contingency table with observed counts and expected counts (under independence), compute
.
Degrees of freedom for an table: .
Example (strawberry spoilage after 5 days): Observed counts: cardboard spoiled 100 / not spoiled 119; plastic spoiled 200 / not spoiled 153. Row totals: 219 and 353; column totals: 300 and 272; grand total 572. Expected counts computed by . Calculated with 1 degree of freedom; p-value ≈ 0.01. Conclusion: reject independence at the 5% level — packaging material is associated with spoilage.
📉 Multiple linear regression — interpretation of R output
The fitted model from R gives coefficient estimates, standard errors, -values and p-values. The regression equation for the expected response is written from the estimates. Example from advertising data:
.
Interpretation:
- Each additional unit of TV budget (1 thousand dollars) is associated with an average increase of about thousand units in sales, holding other predictors constant.
- Radio has a similarly positive effect (≈ per thousand).
- Newspaper coefficient is very small and not statistically significant (high p-value) — no evidence of an effect.
Model quality measures:
- Residual standard error measures typical size of residuals.
- Multiple R-squared (here 0.8972) indicates proportion of variance explained by the predictors. Adjusted R-squared penalises extra variables.
Model selection: prefer simpler models when they explain as much variance. If adding newspaper does not increase R-squared meaningfully and its coefficient is not significant, prefer the model with TV and radio only.
Prediction and residuals:
- Predicted sales at TV=10, radio=10, newspaper=10 (units in thousands): plug values into the equation to get a prediction ≈ (thousand units).
- Residual = observed − predicted. If observed sales = 10 (thousand), residual ≈ (thousand units).
📌 Practical decision rules and significance
- For a one-sided -test at , critical value . Reject H0 if .
- For Chi-Squared with 1 df and the critical value ≈ . Reject H0 if .
- Always report the test statistic, degrees of freedom, p-value and conclusion in context.
🧾 Useful formulas (compact)
- Standard error of the mean: .
- SE of difference between two means: .
- % CI for a large-sample mean: .
- Chi-squared statistic: .
- Logistic regression (log-odds): .
- Poisson regression with offset: .
✅ Final tips for exams
Work clearly: state hypotheses, show formulae and substitutions, compute the test statistic and p-value or compare to critical values, and give a plain-language conclusion about the real-world question. Always check assumptions (sample size for Normal approximations, expected counts for Chi-Squared, linearity and residual behaviour for regression).
Sign up to read the full notes
It's free — no credit card required
Already have an account?
Create your own study notes
Turn your PDFs, lectures, and materials into summarized notes with AI. Study smarter, not harder.
Get Started Free