Lists of Publications

Research Hiatus

I completed my PhD in Information Retrieval, and have done a fair bit of work trying to get into the digital humanities space, some of which was stymied by COVID shutting down most in-person events while I was a junior faculty member. I was trending toward doing work in other applied machine learning venues when I left academia in Summer 2022.

I still have a handful of projects in flight, but they’re more of the systems flavor, for instance, I’m still working with an honors thesis student around using NLP techniques over C source code. I’m hoping to circle around to this project soon.

Information Retrieval Research & Poetry Data

To quote my website from my time as a CS faculty member:

The core of my research is in promoting access to information through classification, categorization, and retrieval with a focus on digital library data. As a result, I am always pursuing the balance between efficient, explainable, and effective machine learning systems.

Over the past few years I have worked with various students to automatically extract, curate, and search poetry extracted from public-domain books.

I’m retiring my live search system (it wasn’t getting very much traffic), but would happily get it running again for an interested scholar. Drop me an email with “Poetry Corpus” in the subject: if you’re interested in working with the data.

Otherwise you can find the datasets from my thesis on CIIR Downloads / Poetry and read more about where this work was going via a poster presented at DH2020.

Code for classifying DJVUXML-formatted scanned books into pages containing poetry can still be found on github: jjfiv/poetry-identification. It’s in Rust, so it shouldn’t version-rot, but please file an issue if you’d like to use it and run into any trouble (or email me).


Twitter is kind of on the way out (Nov 2022), and while I worked with an excellent student (Sivan Nachum) to create a twitter-bot version of my poetry search tool during Summer 2020, I’m also retiring this bot, so the API need not stay up and running.