braytonhall

These are some of my recent projects, updated February 2023.

Novel Summarization with fine-tuned Transformers

To use state-of-the-art transformers in novel ways, and quite possibly on novels themselves.

Specifically, to test the ways in which transformers with many parameters can be fine-tuned with very little labeled data (i.e. few-shot learning), so that a large language model can learn, by itself, to approximate the task of summarizing, for example, the entirety of War and Peace (or any long form document) with output options along different embeddings to capture desired themes. The potential of this type of few-shot learning, in conjunction with active learning, is enormous.

Summarization of the entirety of War and Peace with zero-shot chunking summarization.

The example above represents the first 512 tokens of War and Peace. The following summary corresponds to the first approximately 5,120 words of War and Peace, or about 1/100 of the novel. This could be adjusted, and fine-tuned upon certain 'themes' or embeddings produced from a bit of manual supervision and active learning.

See Code

Semantic Search

A 'search-between-lines' app, to search by connotations and misremembered quotes, rather than exact fragments.

Created using Doc2Vec, a variant of Word2Vec, which is a Python library for creating paragraph embeddings. The model was trained on free ebooks from Project Gutenberg. The model is hosted on AWS S3, and Docker was used to deploy the app on Heroku.

See Code

Ulysses Plot and Sentiment Visualization

The aim of this project was to provide some proof of interdisciplinary collaboration between data science and literary criticism. Ulysses is the ideal novel for such a project, since its reputation, difficulty, and variety of themes is perfect for NLP analysis in conjuction with an already existing plethora of academic research on the novel, which can be used for both corroboration and as a starting off point for representing literary ideas mathematically or otherwise.

The project is also proof of concept for creative and unexplored applications of NLP. The graphic illustrates how each chapter can be vectorized using TF-IDF, converted into two dimensions using PCA, and plotted with lines connecting the chapters in chronological order. The sizes of the bubbles correspond to the word counts of each chapter.

Covid-19 Twitter Topic Modeling

Used unsupervised learning on 100,000 tweets to find the most intelligibly-distinct Twitter communities (e.g.'Economy', 'Trump', 'Healthcare') using LDA (latent Dirichlet allocation), and performed sentiment analysis on those clusters.

See Code

brayton hall

About

Skills

Portfolio

Novel Summarization with fine-tuned Transformers

Semantic Search

Ulysses Plot and Sentiment Visualization

Covid-19 Twitter Topic Modeling

Get in touch