brayton hall



Data Scientist
Natural Language Processing
Philosophy and Literature

About

B.S. Psychology, UNC Chapel Hill
M.A. Philosophy, Virginia Tech

I currently work on a variety of NLP problems, including summarization and classification on large volumes of text by fine-tuning GPT, T5, Pegasus, and other LLMs for a variety of tasks. I enjoy the math and alchemy of fine-tuning.

I'm a data scientist with experience in data acquisition and modeling, statistical analysis, machine learning, and natural language processing. With a background in mathematical logic and philosophy of science at Virginia Tech, I excel at communicating insights on technical, high dimensional, and structured or unstructured data in a way that is interpretable and precise.

Skills

  • Python
  • Large Language Model Alchemist
  • SQL
  • Machine Learning
  • Data Visualization
  • Statistics
  • JavaScript
  • HTML
  • CSS
  • React

Portfolio

These are some of my recent projects, updated February 2023.

Novel Summarization with fine-tuned Transformers

To use state-of-the-art transformers in novel ways, and quite possibly on novels themselves.

Specifically, to test the ways in which transformers with many parameters can be fine-tuned with very little labeled data (i.e. few-shot learning), so that a large language model can learn, by itself, to approximate the task of summarizing, for example, the entirety of War and Peace (or any long form document) with output options along different embeddings to capture desired themes. The potential of this type of few-shot learning, in conjunction with active learning, is enormous.


Semantic Search

A 'search-between-lines' app, to search by connotations and misremembered quotes, rather than exact fragments.

Created using Doc2Vec, a variant of Word2Vec, which is a Python library for creating paragraph embeddings. The model was trained on free ebooks from Project Gutenberg. The model is hosted on AWS S3, and Docker was used to deploy the app on Heroku.

Ulysses Plot and Sentiment Visualization

The aim of this project was to provide some proof of interdisciplinary collaboration between data science and literary criticism. Ulysses is the ideal novel for such a project, since its reputation, difficulty, and variety of themes is perfect for NLP analysis in conjuction with an already existing plethora of academic research on the novel, which can be used for both corroboration and as a starting off point for representing literary ideas mathematically or otherwise.

The project is also proof of concept for creative and unexplored applications of NLP. The graphic illustrates how each chapter can be vectorized using TF-IDF, converted into two dimensions using PCA, and plotted with lines connecting the chapters in chronological order. The sizes of the bubbles correspond to the word counts of each chapter.

Covid-19 Twitter Topic Modeling

Used unsupervised learning on 100,000 tweets to find the most intelligibly-distinct Twitter communities (e.g.'Economy', 'Trump', 'Healthcare') using LDA (latent Dirichlet allocation), and performed sentiment analysis on those clusters.

Get in touch

Don't hesitate to reach out if you have any questions or would like to chat.