Miscellaneous on Jack Gindi

Building ML Paper Explorer (late 2024)

Fri, 13 Dec 2024 00:00:00 +0000

Introduction

Today, I want to talk about a recent personal project. In doing the project, I’d say I was probably 60% motivated by learning about new techniques and tools that are out there, and 40% with seeing what it would be like to put together an end-to-end application with the help of AI assistants, both within my coding environment (Github Copilot) and without (Anthropic’s Claude with a smattering of OpenAI’s ChatGPT). The goal was not to build the most optimized, low-latency, state-of-the-art tool; it was to see if I could get all the pieces working together to do something interesting.

I’ve called what I’ve built ML Paper Explorer. It is a simple interface that allows the user to search and save academic papers on machine learning. The project required me to complete tasks in four broad categories: frontend, backend, machine learning (ML), and deployment. Below, I’ll talk about some of the features I built, what the experience of leaning heavily on AI was like, and close with some broader reflections on being a software engineer in this new age.

If you want to check the project out, go to ml-paper-explorer.com.

Features

I started with a few features I thought were interesting:

Paper relevance engine: The key backend component was a machine learning engine that could search for papers relevant to a user’s query. I implemented a two pass ranking system. The first uses a fast, keyword-based (non-ML) algorithm called BM25 to filter tens or hundreds of thousands of papers down to ~1000. To further filter the smaller collection down to what I show the user (on the order of 10), I use a text embedding model to generate a numerical, semantically rich representation of the user query and find the ~10 papers whose representations (which we’ve pre-stored in a vector database) are the “closest” to it.
Personalization: Users can log in (no password required) to like papers and get recommendations based on what they’ve liked.
Explanations: When a paper is returned in response to a query, the user can see an explanation of why the paper might have been returned in response to the query. This is simply implemented using some prompt engineering on top of the query and title/abstract of the paper in question.

(As a general note, when I talk about papers here, I’m just using titles and abstracts so as to keep storage and processing times reasonable.)

Implementing the web app

Given my background, this part was quite unfamiliar to me, and to do it, I leaned heavily on Anthropic’s Claude.

Claude essentially wrote the first iteration of the frontend completely from scratch. Then, when I decided I wanted separate the initial one-page design into multiple pages, it deftly refactored the code to cleanly handle the updated organization. It even added a very aesthetic landing page without my explicitly asking for it! There were other small, probably underspecified changes I wanted to make, such as displaying a user’s login status more cleanly, adding subtle animation when search results appear, changing the formats of the cards containing paper details, or adding a navigation bar, and it was able to correctly and efficiently make those changes as well. Implementing the backend felt a little bit more comfortable and familiar, but Claude was very helpful setting up some boilerplate code that I found straightforward to extend. As I’m sure other programmers have experienced, Claude was also very useful as a debugging partner.

Deployment with modern cloud-based tools is also not an area in which I’m terribly knowledgable or adept. In order to be fully operational online, I had to deploy:

The backend
The frontend
A database to hold information about paper metadata and users’ liked papers
A vector database for paper similarity search

Enter Claude once again! It identified available services I could use to host the various pieces (though I ended up hosting the frontend with AWS Amplify rather than Vercel) and then helped me stumble my way through getting all of the pieces to connect and talk to one another. It helped me navigate some of Amazon’s web interfaces by looking at screenshots, and also interpreted and explained various error messages that would have taken me longer to resolve on my own.

I should stress here that without Claude’s help, even as a full-time software engineer, this same project that took me a few short weeks to get off the ground would have likely taken me several months, if not much longer.

Reflections

While it was exciting to get something up and running so quickly, doing this project sparked a few thoughts about what the emergence of tools this powerful might mean for me as a software engineer going forward.

First, it seems less and less important than ever for information to live in my head. Not so long ago, to build this simple application I’ve described, I would have had to have reasonable command of web development, dev ops (for deployment), and machine learning fundamentals. I was able to get by with little-to-no up-to-date knowledge about two out of three of those. Would I have been able to do this same project without any programming knowledge whatsoever? I’m not sure we’re there yet. But I was surprised how little I needed to get started.

With generative AI looking like it will intermediate more and more parts of our work lives, it seems much more important to be able to articulate what you’re looking for than to have a lot of pre-loaded a priori knowledge. Another way of saying this is that our ability to accomplish nontrivial things seems to be decorrelating from the amount of time we’ve spent learning about them. Given how much time I’ve spent studying computer science, math, and machine learning over the last decade, I find this unsatisfying! While I still believe that at this stage, deep understanding and investigation still helps me produce my best work, will that be the case if these models keep on improving at this pace?

The second thing that occurred to me is the way that a certain attribute of generative AI tools that disappoints some people makes it excellent as a coding partner. When people prompt AIs for things like essays and poems, their complaint is often that it’s too… well… average. “That essay would get a B+,” they say, “but I certainly wouldn’t give it an A.” I think some disappointment about the quality of AI writing can be boiled down to the fact that it feels sterile and derivative. As a programmer, though, this “average” quality is actually exactly what you want! When asking an AI for help with programming, what you’re looking for is often the consensus opinion about the best way to solve this or that problem. The sort of averaging or convergence that occurs when you compress the entire internet into the parameters of a language model ends up making models like Claude and GPT very helpful programming partners, and frustratingly boring writers. (I think it’s certainly possible to get AI to write things that are interesting, but it usually takes effort and clever prompting.)

While I am admittedly somewhat uncomfortable about the ways that software engineering is going to change – on a shorter timeline than I thought it might – I do believe that humanity will ultimately figure out how to leverage these AI technologies to create a better world. In the near-to-medium term, we will have to be extraordinarily careful about ethically and safely applying them (or not) to sensitive areas like education, the military, biomolecular design, or our financial system, but if we can navigate those challenges successfully, I sincerely believe there is tremendous potential.

It’s very possible, even likely, that in that new and hopefully improved world, you’ll find me with an old computer, disconnected from the internet, coding, unassisted, like we did in the before times.

My first academic publication!

Thu, 01 Apr 2021 00:00:00 +0000

I wanted to share that I’ve been fortunate enough to publish my first bit of academic research that I collaborated on with Nick Moehle, Prof. Stephen Boyd, and Prof. Mykel Kochenderfer! When people ask me what I wish I could have done differently in college, I often lament that I didn’t take the opportunity to get involved with more research. I’m really grateful that I’ve had the opportunity to participate in writing this paper, and hope to collaborate on a few more down the road!

You can find the paper here, and an open-sourced implementation of some of the algorithms we talk about here. Happy to discuss if you’re interested!

Anniversary math

Fri, 27 Nov 2020 00:00:00 +0000

My wife and I got married one year ago today, on 11/27/2019. In honor of this very special day, I wanted to write a special short post showing that, in some sense, we’ve actually been married longer than one year.

There are 365 days in a year (not including leap years). Of those 365, roughly 260 are weekdays. If not for the pandemic, we probably would have spent 3-4 waking hours together per weekday (0 in the morning and 3-4 after we got home from work/school). Of the remaining 105 non-weekdays, we might have spent 9-10 waking hours together. For one year of marriage, without considering small exceptions here and there, we’d thus expect roughly 10 * 105 + 4 * 260 = 2090 waking hours, or 87 waking days, spent together.

Because of COVID-19, we’ve spent roughly 3/4 of our marriage quarantined in lockdown. Instead of 3-4 waking hours together on weekdays, we’ve been spending roughly 12-14 waking hours together per weekday during these unprecedented times. Assuming that we also spent a bit more time together on weekends, say 4 additional hours per day (totaling 13-14 hours), one year of marriage has produced approximately 1/4 * 2090 + 3/4 * (365 * 14) = 4355 waking hours (181 waking days) spent together. While the math is admittedly not entirely rigorous, this past year seems to have actually equated to over 2 years of waking marriage!

This calculation obviously does not take everything into account, but when I think about it, I realize how grateful I am for having spent all this time with my wife over the past year and look forward to many more happy, healthy years ahead. To Alexandra, happy 1 (2?) year anniversary; to everyone else, happy Thanksgiving!

Memorization as a caching mechanism

Sat, 26 Aug 2017 00:00:00 +0000

I was talking to my dad the other day about the advantages and disadvantages of memorizing things and we got into an interesting discussion, the essence of which I thought would make a great topic for a blog post.

My argument against memorization has been (and will continue to be) that I don’t find topics that require lots of memorization that interesting. This is not to say that I don’t think memorization important; on the contrary, I envy those who can remember names of people and the most minute details of conversations they had. The things that I really enjoy learning, though, are the things that require a deeper understanding than that. I think that this need for fundamental understanding is one of the main things that led me to the math major as an undergrad. As our conversation continued and we talked about the way my dad studies music, we expressed differing opinions about the value of memorizing things. As a (now former) computer science student, I started to think of and bring up the “computational” benefits and drawbacks of this approach and ran into the too familiar space/time tradeoff.

On the one hand, being able to reproduce things saves you brain space. You could, for example, remember the essence of a particular recipe without remembering the exact particulars, and then reproduce the recipe just using the main idea — the dish is a savory version of some dish you’ve made before with a garnish of potatoes.

On the other hand, this on-the-fly relearning could (1) take time and (2) cost you mistakes. The other option is to memorize the recipe, item for item. Using that approach, you are less prone to error and you can put the recipe together much faster because you don’t have to reason through anything.

I realized that to me, the primary benefit of memorization is its utility as a time-saver. When information comes in a small enough package, the time you save by memorizing it once might be worth all the time you might spend reproducing it over and over later. (For those with some more technical background, I think of memorization as a caching mechanism of sorts.) Additionally, if the information is information you’re going to need to reproduce over and over, it might be worth memorizing it to save yourself the time in the long run. On the other side, if the “information packet” is unwieldy, though, and you know that you probably won’t need it that often, it might be worth “compressing” the data into its essential bits and reproducing it when you have to, so as to save yourself the headspace.

I always appreciate when the things I study and think about at work make their ways into my daily life; this was a fun example.