Methodology

Last updated

May 6, 2025

Data Collection

This study examines abortion-related media coverage from Fox News and The New York Times. We focus on the period from January 2020 to December 2024. These two outlets were selected due to their national visibility and ideological contrast, and thus, we believe they were ideal for comparative analysis across a politically contentious issue.

To gather the articles, we developed two independent scraping pipelines. For The New York Times, we queried the internal search portal using the term “abortion,” restricted to the specified date range and English-language results. The search was sorted chronologically to allow the collection of complete temporal data. For Fox News, we accessed the abortion category under the Politics > Judiciary section, where articles tagged by the editorial team were archived. To collect them, we built custom Python scripts that acted like virtual readers: they opened each site through a simulated browser, clicked through dynamic pages, and followed internal links just as a human user would. Behind the scenes, the scripts routed all traffic through the Tor network to anonymize activity and avoid being blocked. Using selenium-driverless, we intercepted each site’s internal API calls and extracted article content in full.

Included in the dataset were articles categorized as news reports, editorials, and features, provided they contained substantive textual content. Articles that were duplicates, excessively brief, or consisted primarily of video embeds were excluded to preserve analytical quality. We also incorporated state-level abortion policy data from the Guttmacher Institute. This dataset provides information on bans, gestational limits, and trigger laws.

All content was scraped from publicly available webpages and accessed in accordance with each outlet’s robots.txt file. Our scripts mimicked respectful user behavior and made no attempt to access paywalled or private content. No user data or personal identifiers were collected at any point. All scraping and analysis were conducted locally, and results were used exclusively for non-commercial purposes.

Variables and Measures

Each article served as an individual unit of analysis. For each piece, we recorded the article title, publication date, and source outlet. While we originally intended to classify articles by type (e.g., news, feature, or opinion), structural inconsistencies across both websites limited our ability to do so during scraping; we plan to revisit this classification in future iterations. The full article body was extracted and processed using nltk. Then, we created a binary indicator to distinguish whether an article was published before or after the June 24, 2022 Dobbs decision. Cleaned text was used to compute word counts and to generate unigrams and bigrams by frequency and TF-IDF weighting.

The outcome of interest was article-level sentiment, calculated using the AFINN lexicon. Each word in AFINN is assigned a sentiment score between –5 (strongly negative) and +5 (strongly positive). While individual word scores fall within this fixed range, the aggregated sentiment scores we computed at the article level based on the mean of all matched terms could extend beyond these bounds depending on the length and density of sentiment-bearing language.

Data Analysis

After wrangling, we performed lexicon-based sentiment analysis, topic modeling, co-occurrence network construction, and legal-geographic mapping.

Sentiment was measured by averaging AFINN scores across all matched words per article. Though this approach does not capture context or rhetorical nuance, it provides a method for comparing tone across time and outlet.

We ran Latent Dirichlet Allocation (LDA) topic models separately by outlet and by pre-/post-Dobbs period to identify recurring thematic structures. The resulting topic-term and document-topic distributions were interpreted to assess how media framing shifted over time. LDA modeling was conducted using the topicmodels package in R and gensim in Python.

To visualize linguistic associations, we built word co-occurrence networks from bigram frequencies. Nodes represented terms, while edge weight reflected co-occurrence strength. These networks were rendered using force-directed layouts to highlight topic clusters and central vocabulary implemented using igraph and ggraph.

Finally, we used the Guttmacher dataset to draft a U.S. abortion ban map, illustrating which states implemented total or near-total abortion restrictions post-Dobbs. This map was integrated with the media timeline to support interpretive framing.

All analyses were conducted using Python (for scraping) and R (for wrangling, modeling, visualization, and reporting).

Limitations

Lexicon-based sentiment scoring is limited in scope and does not account for the syntactic or rhetorical context in which words appear. As such, emotion may be misrepresented in articles with subtle or indirect tone. Differences in article length, editorial style, and formatting between The New York Times and Fox News may also affect comparisons of sentiment scores.

As discussed in the main content of the blog, we observed that some articles, particularly pre-Dobbs features from The New York Times, appeared to carry a broadly positive tone based on sentiment scoring. However, closer reading revealed that this positivity often stemmed from emotionally charged language (e.g., words like “hope,” “support,” or “freedom”) used in quoting abortion rights advocates. Lexicon-based sentiment scoring, such as with AFINN, does not account for the context in which words appear. As a result, articles that include supportive quotes or passionate language even when the broader framing is neutral or critical may be mischaracterized in tone.

While our dataset is limited to just two outlets and is not intended to represent the full diversity of the U.S. media landscape, we view this work as a step toward more expansive analyses. We hope it can inform and inspire future projects that broaden the scope of media coverage.

Future Work

Given the scope and limitations of this project, several potential improvements come to mind. Future projects could include additional news sources such as, but not limited to, regional papers, cable news transcripts, and digital-native platforms to increase breadth and representativeness. We are also interested in experimenting with alternative sentiment models and topic modeling techniques that better analyze rhetorical nuance and evolving language use.

Libraries and Tools

This project used a wide range of R packages including tidytext (Silge and Robinson 2023), dplyr (Wickham et al. 2023a), ggplot2 (Wickham 2023a), stringr (Wickham 2023b), readr (Wickham et al. 2023b), tidyr (Wickham and Henry 2023), forcats (Wickham 2023c), and broom (Robinson and Hayes 2023). For network analysis, we relied on igraph (Csárdi and Nepusz 2006), ggraph (Pedersen 2023), echarts4r (Coene 2023), and visNetwork (Thieurmel and Nguyen 2023). Geospatial visualizations were created using leaflet (Cheng 2023) and sf (Pebesma 2023), and for topic modeling we used textmineR (Jones 2023) and topicmodels (Blei et al. 2003).

We also thank the Computer Science community, especially Thu Hoang ’28 and Ryan Ji ’26, for generously assisting with the web scraping components of the project.

References

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003), “Latent dirichlet allocation,” Journal of Machine Learning Research, 3, 993–1022.

Cheng, J. (2023), Leaflet: Create interactive web maps with the JavaScript ’leaflet’ library.

Coene, J. (2023), echarts4r: Create interactive graphs with ’ECharts JavaScript’ library.

Csárdi, G., and Nepusz, T. (2006), Igraph: Network analysis and visualization.

Jones, T. (2023), textmineR: Functions for text mining and topic modeling.

Pebesma, E. (2023), Sf: Simple features for r.

Pedersen, T. L. (2023), Ggraph: An implementation of grammar of graphics for graphs and networks.

Robinson, D., and Hayes, A. (2023), Broom: Convert statistical objects into tidy tibbles.

Silge, J., and Robinson, D. (2023), Tidytext: Text mining using ’dplyr’, ’ggplot2’, and other tidy tools.

Thieurmel, B., and Nguyen, T. T. (2023), visNetwork: Network visualization using ’vis.js’ library.

Wickham, H. (2023a), ggplot2: Create elegant data visualisations using the grammar of graphics.

Wickham, H. (2023b), Stringr: Simple, consistent wrappers for common string operations.

Wickham, H. (2023c), Forcats: Tools for working with categorical variables (factors).

Wickham, H., Francois, R., Henry, L., and Müller, K. (2023a), Dplyr: A grammar of data manipulation.

Wickham, H., and Henry, L. (2023), Tidyr: Tidy messy data.

Wickham, H., Hester, J., and Bryan, J. (2023b), Readr: Read rectangular text data.