Back to Explore

Quantitative Data Analysis and Visualisation — Study Notes Summary & Study Notes

These study notes provide a concise summary of Quantitative Data Analysis and Visualisation — Study Notes, covering key concepts, definitions, and examples to help you review quickly and study effectively.

857 words4 views
Notes

📊 Descriptive statistics — key summaries

Mean, median, mode, variance, standard deviation summarise central tendency and spread. The mean is the arithmetic average; the median is the middle value; the mode is the most frequent value. The variance measures average squared deviation; the standard deviation is its square root and has the same units as the data.

Example (library visits, n = 23): mean9.8269.826, median = 1111, mode = 1111, standard deviation2.4243492.424349, variance5.877475.87747, min = 33, max = 1313, range = 1010, 1st quartile (Q1) = 88, 3rd quartile (Q3) = 11.511.5.

📈 Visualisation — histogram vs boxplot

Histogram: shows the distribution shape and preserves the original values (useful for seeing multimodality and counts per bin).

Boxplot: summarises median, quartiles and whiskers; highlights outliers and makes group comparisons easy. It does not show the detailed shape inside bins.

Practical tip: use a histogram to inspect the detailed shape and a boxplot to summarise and compare groups.

🔍 Skewness and interpretation

Skewness describes asymmetry of the distribution. A left-skewed (negative skew) distribution has a longer left tail and median closer to the upper quartile. For the library visits example the distribution is left-skewed (longer lower whisker and a lower outlier at 3).

🧪 Hypothesis testing — two-sample comparison of means (large samples)

State null and alternative clearly. Example for comparing weekly grocery spending between young professionals (group 1) and retired individuals (group 2):

  • H0: μ1μ2=0\mu_1 - \mu_2 = 0 (no difference)
  • H1: μ1μ2>0\mu_1 - \mu_2 > 0 (young professionals spend more) — one-sided test in this scenario.

When sample sizes are large, a zz-test can be used with sample standard deviations as estimates. The test statistic is

z=(xˉ1xˉ2)(μ1μ2)S12/n1+S22/n2z = \dfrac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{S_1^2/n_1 + S_2^2/n_2}}.

Example numbers: n1=n2=40n_1=n_2=40, xˉ1=85\bar{x}_1=85, xˉ2=78\bar{x}_2=78, S1=12S_1=12, S2=10S_2=10. Calculated z2.834z \approx 2.834, one-sided p-value ≈ 0.00230.0023. Conclusion: at α=0.05\alpha=0.05 reject H0; there is statistically significant evidence that young professionals spend more on groceries on average.

📐 Contingency tables and the Chi-Squared test of independence

Use a Chi-Squared test of independence to check association between two categorical variables. For a contingency table with observed counts OijO_{ij} and expected counts EijE_{ij} (under independence), compute

χ2=i,j(OijEij)2Eij\chi^2 = \sum_{i,j} \dfrac{(O_{ij} - E_{ij})^2}{E_{ij}}.

Degrees of freedom for an r×cr \times c table: (r1)(c1)(r-1)(c-1).

Example (strawberry spoilage after 5 days): Observed counts: cardboard spoiled 100 / not spoiled 119; plastic spoiled 200 / not spoiled 153. Row totals: 219 and 353; column totals: 300 and 272; grand total 572. Expected counts computed by (row total×column total)/572(\text{row total} \times \text{column total})/572. Calculated χ26.5512\chi^2 \approx 6.5512 with 1 degree of freedom; p-value ≈ 0.01. Conclusion: reject independence at the 5% level — packaging material is associated with spoilage.

📉 Multiple linear regression — interpretation of R output

The fitted model from R gives coefficient estimates, standard errors, tt-values and p-values. The regression equation for the expected response is written from the estimates. Example from advertising data:

E(sales)=2.938889+0.045765TV+0.188530radio0.001037newspaperE(\text{sales}) = 2.938889 + 0.045765\cdot\text{TV} + 0.188530\cdot\text{radio} - 0.001037\cdot\text{newspaper}.

Interpretation:

  • Each additional unit of TV budget (1 thousand dollars) is associated with an average increase of about 0.04580.0458 thousand units in sales, holding other predictors constant.
  • Radio has a similarly positive effect (≈ 0.18850.1885 per thousand).
  • Newspaper coefficient is very small and not statistically significant (high p-value) — no evidence of an effect.

Model quality measures:

  • Residual standard error measures typical size of residuals.
  • Multiple R-squared (here 0.8972) indicates proportion of variance explained by the predictors. Adjusted R-squared penalises extra variables.

Model selection: prefer simpler models when they explain as much variance. If adding newspaper does not increase R-squared meaningfully and its coefficient is not significant, prefer the model with TV and radio only.

Prediction and residuals:

  • Predicted sales at TV=10, radio=10, newspaper=10 (units in thousands): plug values into the equation to get a prediction ≈ 5.285.28 (thousand units).
  • Residual = observed − predicted. If observed sales = 10 (thousand), residual ≈ 105.28=4.7210 - 5.28 = 4.72 (thousand units).

📌 Practical decision rules and significance

  • For a one-sided zz-test at α=0.05\alpha=0.05, critical value z0.951.645z_{0.95} \approx 1.645. Reject H0 if z>1.645z > 1.645.
  • For Chi-Squared with 1 df and α=0.05\alpha=0.05 the critical value ≈ 3.8413.841. Reject H0 if χ2>3.841\chi^2 > 3.841.
  • Always report the test statistic, degrees of freedom, p-value and conclusion in context.

🧾 Useful formulas (compact)

  • Standard error of the mean: SE(Xˉ)=Sn\text{SE}(\bar{X}) = \dfrac{S}{\sqrt{n}}.
  • SE of difference between two means: SE(Xˉ1Xˉ2)=S12n1+S22n2\text{SE}(\bar{X}_1-\bar{X}_2)=\sqrt{\dfrac{S_1^2}{n_1}+\dfrac{S_2^2}{n_2}}.
  • 100(1α)100(1-\alpha)% CI for a large-sample mean: Xˉ±zα/2Sn\bar{X} \pm z_{\alpha/2}\cdot \dfrac{S}{\sqrt{n}}.
  • Chi-squared statistic: χ2=k(OkEk)2Ek\chi^2 = \sum_k \dfrac{(O_k - E_k)^2}{E_k}.
  • Logistic regression (log-odds): log(P(Y=1)P(Y=0))=β0+β1X1++βpXp\log\left(\frac{P(Y=1)}{P(Y=0)}\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p.
  • Poisson regression with offset: log(Y)=log(t)+β0+β1X1++βpXp\log(Y) = \log(t) + \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p.

✅ Final tips for exams

Work clearly: state hypotheses, show formulae and substitutions, compute the test statistic and p-value or compare to critical values, and give a plain-language conclusion about the real-world question. Always check assumptions (sample size for Normal approximations, expected counts for Chi-Squared, linearity and residual behaviour for regression).

Sign up to read the full notes

It's free — no credit card required

Already have an account?

Create your own study notes

Turn your PDFs, lectures, and materials into summarized notes with AI. Study smarter, not harder.

Get Started Free