Technical Appendix

Last updated

May 6, 2025

This appendix outlines the technical methodology and implementation details behind each visualization and model used in the blog “Before (and After) Roe v. Wade: A Content Analysis of Conservative and Liberal Media on Abortion Legislation in the United States, 2020–2024.”

The visualizations and models presented in the main blog aim to capture shifting narratives around abortion in the media landscape between 2020 and 2024. However, designing these visual outputs required several additional data wrangling steps and modeling decisions not immediately evident in the main text. This appendix documents those internal decisions to ensure reproducibility.

Sentiment Calendar

The sentiment calendar was created to visualize fluctuations in the emotional tone of abortion-related articles across time in a way that captures both daily granularity and broader temporal rhythms. However, several practical challenges emerged. First, articles were published irregularly, with clusters during major events and long stretches without coverage. Second, not every day yielded enough data for reliable sentiment calculation. These gaps, if left untreated, would create abrupt holes in the calendar heatmap and obscure patterns of intensity and tone. Therefore, we introduced a structured calendar sequence spanning from January 1, 2020 to December 31, 2024 and implemented a light-touch imputation process strictly for visualization purposes. For days with no articles, sentiment scores were missing (i.e., NA). To preserve visual continuity in the calendar heatmap without injecting false sentiment, we applied linear interpolation using the zoo::na.approx() function.

Mathematically, linear interpolation estimates the missing sentiment value \(s_t\) at time \(t\) between two known values \(s_{t_1}\) and \(s_{t_2}\) as:

\[ s_t = s_{t_1} + \frac{t - t_1}{t_2 - t_1} (s_{t_2} - s_{t_1}) \]

This ensures that gaps are filled based on the slope between the two nearest known points. For example, if the sentiment on July 1 is \(-2\) and on July 3 is \(4\), then:

\[ s_{\text{July 2}} = -2 + \frac{1}{2}(4 - (-2)) = 1 \]

This method is widely used for visual smoothing in time series displays when gaps are not analytically meaningful (Shumway and Stoffer 2017; Zeileis and Grothendieck 2005). We did not use interpolation for any modeling steps but only for visual continuity in geom_tile_interactive() rendering.

Remaining missing values at the edges (i.e., consecutive NAs at the start or end) were conservatively replaced with 0, which we interpret as a neutral baseline rather than evidence of balanced tone.

As a disclaimer, the number of days from 2020 to 2024 totals 1,826. Our dataset includes over 3,100 articles, and because many days have multiple articles, the temporal coverage is dense. Thus, we are confident that linear interpolation did not substantially distort underlying sentiment trends.

We also acknowledge that linear interpolation is a relatively simple method and may not fully explain structural gaps in temporal coverage or non-linear shifts in sentiment dynamics. Ideally, future work would explore more principled approaches from the field of missing data theory, such as time-aware imputation with uncertainty quantification. Hopefully, we can revisit this aspect of our project after taking STAT404: Data Not Found: An Introduction to Missing Data Methodology taught by our wonderful STAT231: Data Science instructor herself, Prof. Katharine Correia, when it is next offered!

Topic Modeling and Network Visualizations

Latent Dirichlet Allocation (LDA) is a generative probabilistic model used to detect latent thematic structures in large text corpora. The term latent refers to the fact that the topics inferred by LDA are not directly observable, but are hidden variables inferred from the co-occurrence patterns in the observed data (i.e., words). LDA assumes that each document \(d\) is associated with a topic distribution \(\theta_d \sim \text{Dirichlet}(\alpha)\), and each topic \(k\) is associated with a word distribution \(\phi_k \sim \text{Dirichlet}(\beta)\). For every word \(w_{d,n}\) in document \(d\), a topic \(z_{d,n}\) is first drawn from \(\theta_d\), and a word is then sampled from the corresponding topic’s distribution:

\[ w_{d,n} \sim \text{Multinomial}(\phi_{z_{d,n}}) \]

In this formulation, \(\alpha\) and \(\beta\) are hyperparameters for the Dirichlet priors: \(\alpha\) controls the sparsity of topics within a document, and \(\beta\) controls the sparsity of words within a topic.

The Dirichlet distribution is a multivariate generalization of the Beta distribution. It generates a probability vector of non-negative real numbers that sum to one. Essentially, a distribution over categories. For instance, if a document is composed of three topics, a sample from a Dirichlet distribution might be \((0.7, 0.2, 0.1)\), indicating the document is 70% about Topic 1, 20% about Topic 2, and 10% about Topic 3. Smaller values of \(\alpha\) and \(\beta\) (e.g., \(< 1\)) yield sparse distributions, where documents concentrate on a few topics and topics on a few words. Larger values (e.g., \(> 1\)) yield more uniform distributions. These priors allow LDA to reflect the assumption that documents are usually “about a few things.”

We experimented with values of \(k\) ranging from 4 to 12, but due to computational limitations and the exploratory nature of our analysis, we ultimately selected \(k = 5\). This choice also aligns with other media framing studies using LDA in political discourse contexts, such as (Roberts et al. 2014), which allows us to balance interpretability with thematic resolution.

After fitting, we extracted topic–term probabilities, denoted \(\beta_{k,w} = P(w \mid z = k)\), using tidytext::tidy(). These represent the probability of word \(w\) appearing in topic \(k\), and should not be confused with the Dirichlet prior \(\beta\) used in model initialization. We retained terms with \(\beta_{k,w} > 0.015\), to focus on high-association terms, and manually labeled each topic based on its top-ranked terms. For example, a topic dominated by “heartbeat,” “ban,” and “unborn” was labeled Pro-Life Messaging.

To further examine framing differences, we visualized topic structures as bipartite networks using igraph and visNetwork. Nodes represented topics and bigrams, and edges were weighted by \(\beta_{k,w}\). Topic nodes were styled by outlet and time period, using box shapes and consistent color palettes, and interactivity allowed users to trace how themes emerged and diverged across time. Furthermore, one can see that shareed bigrams appeared across multiple topic nodes, which shows recurring rhetorical devices and contested framings that spanned ideological divides.

While LDA is foundational in topic modeling, it is limited by the bag-of-words assumption and fixed topic structures. We look forward to revisiting these questions with more flexible methods, such as dynamic topic models and contextual embeddings, once STAT325: Text Analytics is next offered by Prof. Nicholas Horton!

References

Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., Albertson, B., and Rand, D. G. (2014), “Structural topic models for open-ended survey responses,” American Journal of Political Science, 58, 1064–1082. https://doi.org/10.1111/ajps.12103.
Shumway, R. H., and Stoffer, D. S. (2017), Time series analysis and its applications: With r examples, Springer.
Zeileis, A., and Grothendieck, G. (2005), “Zoo: S3 infrastructure for regular and irregular time series,” Journal of Statistical Software, 14, 1–27.