Back to Explore

Quantitative Data Analysis and Visualisation — Comprehensive Study Notes Summary & Study Notes

These study notes provide a concise summary of Quantitative Data Analysis and Visualisation — Comprehensive Study Notes, covering key concepts, definitions, and examples to help you review quickly and study effectively.

897 words4 views
Notes

📊 Descriptive statistics — key concepts

Descriptive statistics summarise numerical data using measures of location and dispersion. Common summaries: mean, median, mode, variance, standard deviation, minimum, maximum, range, quartiles. Visual summaries include histograms and boxplots.

➗ Measures of central tendency and dispersion

  • Sample mean: xˉ=1ni=1nxi\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i. This is the arithmetic average and is sensitive to outliers.
  • Median: the middle value when data are ordered. For even nn, average the two middle values.
  • Mode: the most frequent value(s).
  • Sample variance: S2=1n1i=1n(xixˉ)2S^2=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2. Use n1n-1 for an unbiased estimator of population variance.
  • Sample standard deviation: S=S2S=\sqrt{S^2}.
  • Range: range=maxmin,\text{range}=\text{max}-\text{min}. Gives overall spread but is sensitive to extremes.
  • Quartiles (Q1, Q2, Q3) split the ordered data into four equal parts; Q2 is the median. These are used in boxplots.

📈 Boxplots and histograms — interpretation

  • Histogram: shows the distribution shape and preserves the original data bins. Useful for seeing modality and approximate distribution form.
  • Boxplot: highlights median, quartiles, whiskers (approximate spread), and outliers. Good for comparisons across groups.
  • Skewness from plots: A longer left whisker or long left tail indicates left-skew, a longer right whisker/right tail indicates right-skew, and symmetric whiskers indicate approximate symmetry.

🧪 Hypothesis testing for difference of two means (large samples)

When comparing two independent groups with large sample sizes (n30n\ge 30), a z-test for difference of means can be used (using sample s.d. as estimates of population s.d.).

  • Null hypothesis: H0:  μ1μ2=0H_0:;\mu_1-\mu_2=0 (no difference).
  • Alternative (one-sided or two-sided) depending on the research question (e.g. H1:  μ1μ2>0H_1:;\mu_1-\mu_2>0 if expecting group 1 larger).
  • Test statistic: z=(xˉ1xˉ2)(μ1μ2)S12/n1+S22/n2z=\displaystyle\frac{(\bar{x}_1-\bar{x}_2)-(\mu_1-\mu_2)}{\sqrt{S_1^2/n_1+S_2^2/n_2}}. Under H0H_0 take μ1μ2=0\mu_1-\mu_2=0.
  • Decision: compare zz to critical value zαz_{\alpha} (one-sided) or use p-value from standard normal.

Example (numbers from a practice problem): n1=n2=40n_1=n_2=40, xˉ1=85\bar{x}_1=85, xˉ2=78\bar{x}_2=78, S1=12S_1=12, S2=10S_2=10. Compute z=8578122/40+102/402.834z=\frac{85-78}{\sqrt{12^2/40+10^2/40}}\approx 2.834. For a one-sided test at α=0.05\alpha=0.05, critical z1.645z\approx 1.645, so reject H0H_0 and conclude group 1 has larger mean.

⚖️ Interpreting test results

  • If z>zα/2|z|>z_{\alpha/2} (two-sided) or z>zαz>z_{\alpha} (one-sided), reject H0H_0 at level α\alpha.
  • Report conclusion in context (e.g. "statistically significant evidence that young professionals spend more on groceries than retired individuals at the 5% level").

🔢 Chi-squared test for independence (contingency tables)

Use this to test association between two categorical variables (e.g. packaging type and spoilage).

  • Observed table: cells OijO_{ij}, row totals and column totals.
  • Expected counts under independence: Eij=(row totali)(col totalj)nE_{ij}=\frac{(\text{row total}_i)(\text{col total}_j)}{n}.
  • Chi-squared statistic: χ2=ij(OijEij)2Eij\chi^2=\sum_{i}\sum_{j} \frac{(O_{ij}-E_{ij})^2}{E_{ij}}.
  • Degrees of freedom: (r1)(c1)(r-1)(c-1) for an r×cr\times c table.
  • Decision: compare χ2\chi^2 to critical value from χν2\chi^2_{\nu} or use p-value.

Example (numbers from the exam): Observed: Carboard spoiled 100, not spoiled 119 (row total 219); Plastic spoiled 200, not spoiled 153 (row total 353); column totals: spoiled 300, not spoiled 272; total n=572n=572. Expected example: E(Carboard, spoiled)=219×300572114.86E(\text{Carboard, spoiled})=\frac{219\times300}{572}\approx114.86. Calculated χ26.55\chi^2\approx 6.55 with df=1df=1. Critical at α=0.05\alpha=0.05 is 3.841\approx3.841, so reject independence — conclude an association between packaging and spoilage.

📉 Multiple linear regression — interpretation of R output

  • Model form: E(Y)=β0+β1X1+β2X2+E(Y)=\beta_0+\beta_1 X_1+\beta_2 X_2+\dots.
  • Coefficient interpretation: each coefficient is the expected change in YY for a one-unit increase in that predictor, holding others constant.
  • From example R output: estimated model E(sales)=2.938889+0.045765(TV)+0.188530(radio)0.001037(newspaper)E(\text{sales})=2.938889+0.045765(\text{TV})+0.188530(\text{radio})-0.001037(\text{newspaper}).
  • Significance: use t-statistics and p-values for each coefficient. Very small p-values (e.g. < 0.001) indicate strong evidence the coefficient differs from zero.
  • Prediction: plug in predictor values into the fitted equation for a point prediction. Example: for TV=10, radio=10, newspaper=10, E(sales)=2.938889+0.045765×10+0.188530×100.001037×105.278889E(\text{sales})=2.938889+0.045765\times10+0.188530\times10-0.001037\times10\approx 5.278889 (thousand units).
  • Residual: observed minus predicted. If observed sales = 10 (thousand), residual =105.2788894.721111=10-5.278889\approx 4.721111 (thousand).
  • R-squared: proportion of variance explained by the model. Higher R2R^2 indicates better fit but beware overfitting. Prefer simpler models if they explain similar variance (parsimony).

Model selection example: R2R^2 for TV+radio+newspaper = 0.8972 and for TV+radio = 0.8972. The additional predictor (newspaper) gives no improvement in R2R^2, so prefer the simpler TV+radio model.

🧮 Degrees of freedom and sample size from regression output

  • In the linear model output, residual degrees of freedom = nkn - k where kk is the number of parameters (including intercept). So n=residual df+kn=\text{residual df}+k.
  • Example: residual df = 196, k=4k=4 (intercept + 3 predictors), so n=196+4=200n=196+4=200 regions.

🧾 Useful formulas (formula-sheet style)

  • Standard error of the sample mean: SE(Xˉ)=S2/n\text{SE}(\bar{X})=\sqrt{S^2/n}.
  • Standard error for difference of two means: S12/n1+S22/n2\sqrt{S_1^2/n_1+S_2^2/n_2}.
  • Large-sample 100(1α)%100(1-\alpha)% CI for mean: (Xˉzα/2Sn,  Xˉ+zα/2Sn)\left(\bar{X}-z_{\alpha/2}\frac{S}{\sqrt{n}},;\bar{X}+z_{\alpha/2}\frac{S}{\sqrt{n}}\right).
  • Hypothesis test for proportion uses standard error π0(1π0)/n\sqrt{\pi_0(1-\pi_0)/n} under H0H_0.
  • Chi-squared statistic: χ2=(OkEk)2Ek\chi^2=\sum\frac{(O_k-E_k)^2}{E_k}.
  • Logistic regression form: log(P(Y=1)P(Y=0))=β0+β1X1++βpXp\log\left(\frac{P(Y=1)}{P(Y=0)}\right)=\beta_0+\beta_1X_1+\dots+\beta_pX_p.
  • Poisson regression (with offset tt): log(Y)=log(t)+β0+β1X1++βpXp\log(Y)=\log(t)+\beta_0+\beta_1X_1+\dots+\beta_pX_p.

🔎 Reference z / t values (common quantiles)

  • One-sided α=0.05\alpha=0.05 critical: z1.645z\approx 1.645.
  • Two-sided α=0.05\alpha=0.05 critical: z0.9751.959964z_{0.975}\approx 1.959964.
  • Useful R-derived values: qnorm(0.95)1.644854\text{qnorm}(0.95)\approx 1.644854, qnorm(0.975)1.959964\text{qnorm}(0.975)\approx 1.959964.

✅ Practical tips for exam-style problems

  • Always state hypotheses in context and specify whether the test is one-sided or two-sided.
  • Show formula, substitute numbers, compute the test statistic, state critical value or p-value, then give a contextual conclusion.
  • For contingency tables, check expected counts (all should be reasonably large for the Chi-squared approximation to hold).
  • For regression interpretation, comment on significance (p-values), sign and magnitude of coefficients, goodness-of-fit (R2R^2), and practical uncertainty (residuals, prediction intervals not just point predictions).

These notes summarise the core tools used in the sample exam: descriptive summaries and plots, z-tests for mean differences, Chi-squared tests for independence, and interpretation of multiple linear regression output.

Sign up to read the full notes

It's free — no credit card required

Already have an account?

Create your own study notes

Turn your PDFs, lectures, and materials into summarized notes with AI. Study smarter, not harder.

Get Started Free
Quantitative Data Analysis and Visualisation — Comprehensive Study Notes Study Notes | Cramberry