What topics are covered in these Quantitative Data Analysis and Visualisation — Study Notes notes?

These study notes cover key concepts and summaries for Quantitative Data Analysis and Visualisation — Study Notes.

Quantitative Data Analysis and Visualisation — Study Notes Study Notes

Q: Are these Quantitative Data Analysis and Visualisation — Study Notes study notes free?

Yes, you can read these study notes for free on Cramberry.

Notes

📊 Descriptive statistics — key summaries

Mean, median, mode, variance, standard deviation summarise central tendency and spread. The mean is the arithmetic average; the median is the middle value; the mode is the most frequent value. The variance measures average squared deviation; the standard deviation is its square root and has the same units as the data.

Example (library visits, n = 23): mean ≈ $9.826$ , median = $11$ , mode = $11$ , standard deviation ≈ $2.424349$ , variance ≈ $5.87747$ , min = $3$ , max = $13$ , range = $10$ , 1st quartile (Q1) = $8$ , 3rd quartile (Q3) = $11.5$ .

📈 Visualisation — histogram vs boxplot

Histogram: shows the distribution shape and preserves the original values (useful for seeing multimodality and counts per bin).

Boxplot: summarises median, quartiles and whiskers; highlights outliers and makes group comparisons easy. It does not show the detailed shape inside bins.

Practical tip: use a histogram to inspect the detailed shape and a boxplot to summarise and compare groups.

🔍 Skewness and interpretation

Skewness describes asymmetry of the distribution. A left-skewed (negative skew) distribution has a longer left tail and median closer to the upper quartile. For the library visits example the distribution is left-skewed (longer lower whisker and a lower outlier at 3).

🧪 Hypothesis testing — two-sample comparison of means (large samples)

State null and alternative clearly. Example for comparing weekly grocery spending between young professionals (group 1) and retired individuals (group 2):

H0: $\mu_1 - \mu_2 = 0$ (no difference)
H1: $\mu_1 - \mu_2 > 0$ (young professionals spend more) — one-sided test in this scenario.

When sample sizes are large, a $z$ -test can be used with sample standard deviations as estimates. The test statistic is

$z = \dfrac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{S_1^2/n_1 + S_2^2/n_2}}$ .

Example numbers: $n_1=n_2=40$ , $\bar{x}_1=85$ , $\bar{x}_2=78$ , $S_1=12$ , $S_2=10$ . Calculated $z \approx 2.834$ , one-sided p-value ≈ $0.0023$ . Conclusion: at $\alpha=0.05$ reject H0; there is statistically significant evidence that young professionals spend more on groceries on average.

📐 Contingency tables and the Chi-Squared test of independence

Use a Chi-Squared test of independence to check association between two categorical variables. For a contingency table with observed counts $O_{ij}$ and expected counts $E_{ij}$ (under independence), compute

$\chi^2 = \sum_{i,j} \dfrac{(O_{ij} - E_{ij})^2}{E_{ij}}$ .

Degrees of freedom for an $r \times c$ table: $(r-1)(c-1)$ .

Example (strawberry spoilage after 5 days): Observed counts: cardboard spoiled 100 / not spoiled 119; plastic spoiled 200 / not spoiled 153. Row totals: 219 and 353; column totals: 300 and 272; grand total 572. Expected counts computed by $(\text{row total} \times \text{column total})/572$ . Calculated $\chi^2 \approx 6.5512$ with 1 degree of freedom; p-value ≈ 0.01. Conclusion: reject independence at the 5% level — packaging material is associated with spoilage.

📉 Multiple linear regression — interpretation of R output

The fitted model from R gives coefficient estimates, standard errors, $t$ -values and p-values. The regression equation for the expected response is written from the estimates. Example from advertising data:

$E(\text{sales}) = 2.938889 + 0.045765\cdot\text{TV} + 0.188530\cdot\text{radio} - 0.001037\cdot\text{newspaper}$ .

Interpretation:

Each additional unit of TV budget (1 thousand dollars) is associated with an average increase of about $0.0458$ thousand units in sales, holding other predictors constant.
Radio has a similarly positive effect (≈ $0.1885$ per thousand).
Newspaper coefficient is very small and not statistically significant (high p-value) — no evidence of an effect.

Model quality measures:

Residual standard error measures typical size of residuals.
Multiple R-squared (here 0.8972) indicates proportion of variance explained by the predictors. Adjusted R-squared penalises extra variables.

Model selection: prefer simpler models when they explain as much variance. If adding newspaper does not increase R-squared meaningfully and its coefficient is not significant, prefer the model with TV and radio only.

Prediction and residuals:

Predicted sales at TV=10, radio=10, newspaper=10 (units in thousands): plug values into the equation to get a prediction ≈ $5.28$ (thousand units).
Residual = observed − predicted. If observed sales = 10 (thousand), residual ≈ $10 - 5.28 = 4.72$ (thousand units).

📌 Practical decision rules and significance

For a one-sided $z$ -test at $\alpha=0.05$ , critical value $z_{0.95} \approx 1.645$ . Reject H0 if $z > 1.645$ .
For Chi-Squared with 1 df and $\alpha=0.05$ the critical value ≈ $3.841$ . Reject H0 if $\chi^2 > 3.841$ .
Always report the test statistic, degrees of freedom, p-value and conclusion in context.

🧾 Useful formulas (compact)

Standard error of the mean: $\text{SE}(\bar{X}) = \dfrac{S}{\sqrt{n}}$ .
SE of difference between two means: $\text{SE}(\bar{X}_1-\bar{X}_2)=\sqrt{\dfrac{S_1^2}{n_1}+\dfrac{S_2^2}{n_2}}$ .
$100(1-\alpha)$ % CI for a large-sample mean: $\bar{X} \pm z_{\alpha/2}\cdot \dfrac{S}{\sqrt{n}}$ .
Chi-squared statistic: $\chi^2 = \sum_k \dfrac{(O_k - E_k)^2}{E_k}$ .
Logistic regression (log-odds): $\log\left(\frac{P(Y=1)}{P(Y=0)}\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p$ .
Poisson regression with offset: $\log(Y) = \log(t) + \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p$ .

✅ Final tips for exams

Work clearly: state hypotheses, show formulae and substitutions, compute the test statistic and p-value or compare to critical values, and give a plain-language conclusion about the real-world question. Always check assumptions (sample size for Normal approximations, expected counts for Chi-Squared, linearity and residual behaviour for regression).

Sign up to read the full notes

It's free — no credit card required

Already have an account?

Create your own study notes

Turn your PDFs, lectures, and materials into summarized notes with AI. Study smarter, not harder.

Get Started Free

Quantitative Data Analysis and Visualisation — Study Notes Summary & Study Notes