C*-0-$0$$$OMP0-O|--* |$$N*E|$|||NT|-$$|$$

Building a Pitchfork text generator with GPT-2

Introduction

by Rob Arcand

Of the many risks that Pitchfork Media took in its early years, publishing freely-accessible, high-volume criticism may have been the most impactful. The self-proclaimed “most trusted voice in music” became the first truly essential destination for online music criticism, outlasting nearly every zine and website of the 1990s and 2000s through its sheer commitment to daily publishing. Unlike legacy music magazines, which still saw the web as a liability for the copyrighted creations they’d spent decades institutionalizing, Pitchfork kicked web publishing into overdrive, industrializing the form and function of a magazine at a scale once unheard of in the media industries. Posting four album reviews and numerous news stories each day, the site developed a devout and impassioned audience as it quickly transcended its underdog status to become the kind of proper institution its snarky writers once criticized.

Pitchfork’s reviews in turn have developed an appeal to two different kinds of programmers for two distinct reasons. For the hobbyist, a language model trained on Pitchfork data has become a de facto “Hello, world” for testing new algorithms to see how well a machine can mimic the site’s trademark affectations. For commercial and academic data scientists, the uniformity of the reviews’ language has made them a popular, standardized corpus. In his doctoral work at the MIT Media Lab, the computer scientist Brian Whitman used data taken from Pitchfork reviews (among other music reviews sites) to train a linguistic model that would eventually become a core part of the Echo Nest, a music intelligence platform that was bought by Spotify in 2014 as part of their growing investment in music analytics. But does the review itself provide any real insight into the mechanics of the music it’s describing? Or is it now merely secondary to the formal logic of the publication, an industrialized medium to fill an economic void for the site?

OpenAI, the San Francisco-based artificial intelligence company co-founded by Elon Musk, has developed a series of incredibly impressive natural language models for generating text. These General Pre-Trained (GPT) models use a broad, content-agnostic corpus taken from around the internet that’s refined through a process of “discriminative fine-tuning” specific to the model’s intended use-case. The approach has proven to be shockingly adept at summarizing news articles, answering reading comprehension questions, and writing entire passages with the uncanny feeling of human authorship. Earlier this summer, the company made their GPT-2 model (a leaner version of the model that preceded the recently released sensation that is GPT-3) available to the public through a general-purpose API, which allows users to fine-tune the model in accordance with their own needs and receive a text-based response consistent with their training set.

Testing a GPT-2 model trained on Pitchfork for the first time, it’s clear that it has a foundational grasp of grammar and syntax, combining words and phrases in ways consistent with higher-order meaning. Where other models like the ubiquitous Markov chain struggle to move beyond simple grammar to structure sentences with much complexity, GPT-2 captures the implicit tonal and syntactic rhythm of the review at the scale of the sentence, crafting lines that feel like otherworldly collages of the original Pitchfork corpus. References to very real outside artists sit alongside completely fictional track titles and album names, as sentences achieve a kind of a structural continuity that feels characteristic of many latent space mathematical models. Strange redundancies reveal a general disregard for ordered logic amid what appears to be a broader commitment to meaning, as the model returns to the same ideas again and again with similar phrases. A sentence like “This is a field-recorded track, and it's got a nice, gothic quality to it, but it's also got a nice, gothic feel to it...” might feel like overkill until you realize that the model might effectively be drawing the same conclusion twice, doubling-down on the corpus’ most concrete characteristics in a way that only appears cyclical on the surface.

Switching between genres, certain patterns in word pairings—“Houston-based producer,” “avant metal trio,” “New York-disco style”—start to emerge, and the model shows a striking ability to locate consistencies in the ways writers discuss genre. “Pop-rock roots” align with the Pitchfork’s Pop/R&B category, while the Folk/Country genre accurately identifies the regional inflection of a fictional track title like “I Ain’t Marching Anymore.” At the same time, adjustments made to the score parameter don’t appear to affect the kind of hollow positivity of each response, suggesting a certain ambiguity in the connection between numerical score and emotional valence that the model can’t quite identify. For a site so committed to quantification, you’d think this connection would be more evident.

Of course, Pitchfork itself, constituted by real humans whose experiences are brought into their writing one way or another, is subject to change in a way an eternally backward-looking model isn’t. What the model captures is not necessarily an institution as much as an aggregate of many periods of an institution, every version except for those that might yet come and surprise.

Tutorial

By Jules Becker

The following documentation details how to fine-tune GPT-2 on a text dataset—in this case, Pitchfork music reviews—and create a simple API to serve generated text from a database.

1. Data prep

To start, download the Pitchfork review dataset from Components. In order to fine-tune GPT-2 on text data, it needs to be in a single-column CSV, with one example per row. The first step, then, is getting the reviews in this format.

We could just fine-tune GPT-2 on the raw Pitchfork review text, but it would be more interesting to incorporate some of the metadata we have alongside each review. In particular, by adding the score and genre to the text of each review, then it will be possible to prompt the GPT-2 model to generate review text for a specific score and genre later on. This concept of “internal metadata” is a common way to fine tune GPT-2 on text from a range of authors, categories, or styles, and then conditionally generate text from each one. Accordingly, each row in the single-column CSV should be in this format for fine-tuning:

"8.4 
Electronic,Experimental
If you know the true identity of London dubstep artist Burial, consider yourself a member of a very exclusive circle. Steve Goodman, who runs London's Hyperdub label, knows..."

Then, you’ll be able to prompt the model with text like “4.5\nRock” and the model will complete the review for a rock album with a score of 4.5.

The pandas library is helpful for getting the data into this format. First, open up a command line and install pandas and sqlite3 to use in Python:

pip install pandas sqlite

This also provides a good opportunity to do some basic data cleaning and filtering. Along with exporting the CSV, the Python code below filters out reviews that are very long/short or contain extra newlines. In particular, it’s important that none of your text examples are too long (more than 10,000 characters, for example), because otherwise you’ll get a “field larger than field limit” error loading the CSV later. The exported file shouldn’t have row or column names—just the score, genre, and text for each review—so df.to_csv() should be called with header=False and index=False.

import sqlite3 
import pandas as pd

db = sqlite3.connect('pitchfork.db')
df = pd.read_sql_query("SELECT genre, score, review FROM reviews", db)  

df = df[df.review.str.len() > 20]
df = df[df.review.str.len() < 10000] df = df[~df.review.str.contains('\n')]

single_column = df.score.map(str) + "\n" + df.genre + "\n" + df.review  
single_column.to_csv('score_genre_review.csv', header=False, index=False)

2. Set up your GPU environment

Once you have the text CSV ready to go, you’ll need a GPU environment to train the language model. If you don't have your own GPU environment setup already, both Paperspace Gradient and Google Colab offer free GPU services that work well for this kind of thing. I prefer Paperspace, because it offers a permanent filesystem–in Colab you have to copy files in and out of Google Drive, so you can lose training checkpoints if you’re not careful. Also, in Paperspace you don’t have to keep your browser window open for your code to run.

Importantly, the Paperspace Jupyter environment lets you open up a dedicated terminal window, which is helpful for running shell commands and launching Python scripts. You should run the training and generation commands in the rest of this tutorial using this terminal.

Once you’ve logged into Paperspace Gradient, choose “Run a sample notebook,” select the Tensorflow 1.14 container, and open a new free GPU notebook using either the “Free-GPU” or “Free-P5000” options. After waiting for the instance to be provisioned, open the notebook, and at the top of the Jupyter window, click New->Terminal.

Next, install Max Woolf’s gpt-2-simple library via the command line:

pip install gpt-2-simple

Max also has a more recent project, aitextgen, that uses the HuggingFace Transformers library. However, you can currently only finetune the smallest GPT-2 model (124M parameters) with aitextgen, while gpt-2-simple lets us train the larger 355M parameter model.

Once gpt-2-simple is installed, transfer your text CSV onto the remote machine. Since this is a large file (~100MB), you can use gdown to download it from your Google Drive on the command line if you don’t want to upload it directly. Make sure to save this to the permanent /storage directory in Paperspace. This is also a good time to create dedicated subdirectories for your model and training checkpoints; I named them “/storage/model/” and “/storage/checkpoint/”.

3. Fine-tune GPT-2 on your text

Once your text CSV and storage directories are in place, use the gpt_2_simple command line interface to begin fine-tuning the model. The command below will load the default 355M parameter GPT-2 model and then fine-tune it for 1,000 steps on the Pitchfork text.

gpt_2_simple finetune --run_name 'pitchfork_run1' \   
    --dataset '/storage/score_genre_review.csv' \   
    --checkpoint_dir '/storage/checkpoint' \  
    --model_dir '/storage/model' \   
    --model_name '355M' \   
    --restore_from 'latest'    
    --steps 1000 \   
    --print_every 20 \   
    --save_every 500

This should take around 25 minutes in the Free-P5000 environment. If you plan on running this command multiple times, I recommend copying the whole thing into a shell script named something like "finetune.sh" so you can easily change parameters and run it again (in the Paperspace Jupyter terminal I had to run chmod +x finetune.sh to make the file executable first). The --restore_from 'latest' argument will ensure that it always picks up at the last saved model checkpoint. There are some other interesting parameters you can add to the fine-tuning command: “--sample_every 100” will generate a text sample every 100 steps. If you want to use a smaller model, which will sacrifice some accuracy for training and generation speed, you can change ‘355M’ to ‘124M.’

An important question here is how many steps to fine-tune the model for. This is very much an open question, but the answer generally depends on how large and complex your text data is. For fine-tuning GPT-2 on a few hundred tweets, 200-500 steps would likely suffice. However, this Pitchfork dataset contains a lot of long, fairly detailed text documents, so at least a few thousand steps is a good target. In my experience, the model’s output continued to increase in quality over tens of thousands of steps, although it started generating convincing Pitchfork reviews after only a couple thousand. Oftentimes just reading through generated examples is the best way to tell when to stop fine-tuning; you can also watch the training loss to see when that stops decreasing. One thing to watch out for here is overfitting: if you fine-tune the model for too many steps, it will start to copy phrases verbatim from the training data. Occasionally searching your text dataset for phrases that your model is generating can help tell you when you've trained for too long and the model has begun to overfit.

4. Generate some text

Now that the model is fine-tuned, you can use the gpt_2_simple CLI to generate some text. If you want to experiment quickly with different generation parameters, I recommend creating a Jupyter notebook in Paperspace and using the gpt2.generate(sess) function outlined here so you only have to load the model once to try out different combinations or prompts. However, once you’re confident with generation parameters and you want to automatically dump a bunch of generated texts into a text file, use this command:

gpt_2_simple generate --run_name 'pitchfork_run1' \   
    --checkpoint_dir '/storage/checkpoint' \   
    --nsamples 200 \   
    --batch_size 10 \   
    --prefix '<|startoftext|>' \   
    --truncate '<|endoftext|>' \   
    --include_prefix False \   --top_p 0.95

If you want to try conditionally generating reviews for a specific score or score/genre combo, change the --prefix parameter to <|startoftext|>6.7 or <|startoftext|>8.2\nElectronic (for example).

Like before, it’s handy to save this generation command to a file like “generate.sh” so you can run it multiple times. Most of the generation parameters are fairly self-explanatory (or explained in the code), but one of particular note is top_p. Setting this to a value between 0 and 1 applies nucleus sampling, which restricts the language model’s choice of words to a smaller “nucleus” of the most likely words. You can read more about how this works here, but generally it should result in more realistic text with less repetition (a common idiosyncrasy of language models like GPT-2). Also, higher values for --temperature above the default of 0.7 will result in the model's predictions getting more random. Play around with different parameters and values!

5. Set up a database and generation pipeline

To deploy your fine-tuned GPT-2 model online, there are a few possibilities. First, you could deploy it on GPUs using something like Google Cloud Run and have it generate text in real time as users visit your app. Users could enter arbitrary prompts for the model to complete. This could get very costly, though, because generating even just one text utilizes a full GPU for 5-10 seconds. Imagine doing this constantly, with a bunch of concurrent requests! This will get pricey even if you figure out some good batching strategies and other cost optimizations–even the phenomenal Talk to Transformer became financially untenable.

A second option is to use a single GPU (like the free ones on Paperspace or Colab) to generate large batches of examples, and store them in a database to later recall on demand when a user wants to view a newly generated review. The downside here is that users can’t specify custom prompts–or we at least have to limit the potential prompts to some pre-selected options. In this particular case, we can pre-generate Pitchfork reviews for different combinations of scores and genres and let users select the particular score and genre they want to see a review for.

To store the documents, we can use a cloud database like Google Cloud Firestore, which provides a NoSQL database with a free usage tier. Firestore's built-in indexing will make querying for reviews of a specific score and genre very fast. The fact that it’s in the cloud, with a Python API, also makes it easy to add documents to the database from Paperspace GPU and then access them later from the web server (or wherever else).

To generate reviews with your model and add them to the database, create a new Python script in the Paperspace instance (at the top of the Jupyter window, go to New->Text File and then save it with a .py extension). First, import the relevant packages, load the model, and connect to the database (this assumes you’ve created a collection in Firestore called “reviews”):

import os 
import gpt_2_simple as gpt2 
from google.cloud import firestore  

# replace './key.json' with the path to your Google Cloud key 
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = './key.json'  

sess = gpt2.start_tf_sess() 
gpt2.load_gpt2(sess, checkpoint_dir='/storage/checkpoint', run_name='pitchfork_run1')

db = firestore.Client() 
reviews_ref = db.collection("reviews")

Next, define all the (score, genre) pairs that you want to generate reviews for. In this case, we'll only generate reviews for scores that are multiples of 0.5:

scores = [x / 2 for x in range(21)] 
genres = ['Rock', 'Electronic', 'Rap', 'Pop/R&B', 'Experimental', 'Folk/Country', 'Metal', 'Jazz'] 
pairs = [(i, j) for i in scores for j in genres]

Although GPT-2 largely outputs properly formatted text, you can add a few simple text processing steps to remove extra start-of-text tokens and make sure the review doesn’t end mid-sentence. Also, you can trim off the lines containing the score and genre and store that metadata separately. Here’s a function for processing each review accordingly:

def process(review):   
    # remove everything before the last <|startoftext|>   
    review = review.split('<|startoftext|>')[-1]      
    
    # and also before the last newline (this will get rid of the prompt)
    review = review.split('\n')[-1]     
    
    # throw out reviews shorter than a sentence   
    if review.count('.') == 0:     
        return None   
    else:    
        # trim after the last period     
        review = review.rsplit('.', 1)[0] + '.'            
        
        return review

Now, iterate through each of the (score, genre) pairs and generate reviews for each:

for score, genre in pairs:   
    review_prefix = '<|startoftext|>' + str(score) + '\n' + genre + '\n'  
    text = gpt2.generate(sess,       
            run_name='pitchfork_run1',   
            checkpoint_dir='/storage/checkpoint',       
            prefix=review_prefix,       
            truncate='<|endoftext|>',       
            return_as_list=True,       
            include_prefix=False,       
            nsamples=50,       
            batch_size=10,       
            length=500,       
            temperature=0.7,       
            top_p=0.95       
            )    
        
    # the "if process(r)" removes None values from the list   
    processed = [process(r) for r in text if process(r)]

Still within this for loop, add each batch of reviews to the database. The selection_count attribute will be used when you eventually deploy the app to keep track of how many times a particular review has been shown to someone.

for score, genre in pairs: # from above   
    # ... generation code   
    batch = db.batch()    
    
    for review in processed:     
        data = {      
            "text": review,       
            "score": score,       
            "genre": genre,       
            "length": len(review.split()),       
            "added": firestore.SERVER_TIMESTAMP,       
            "selection_count": 0     
        }      
        
        # create a ref with auto-generated ID     
        new_review_ref = reviews_ref.document()      

        # add it to the batch     
        batch.set(new_review_ref, data)    
    
    batch.commit()

You can also add some more values to the review object, like the generation temperature or model version you used. This will allow you to query based on that information later on (“only give me reviews generated with a temperature of 0.9”).

6. Serve the generated text via an API

The final step here is to put a simple API in front of the database queries, so you can send a simple GET request to a URL (like <yoururl>/review/?score=8.0&genre=Rap) and it will spit out a new review. Users can ask for particular scores or genres by adding query parameters to the URL, like <yoururl>/review/?score=8.0&genre=Rap. It’d also be nice to have permalinks so people can share generated reviews, so a GET request to /review/<id> should return the review with that ID. Here’s a simple way to implement this server in Flask; you’ll need to set the Google Cloud key environment variable for this to work (you can do this in Python with os.environ like above, or via the command line).

from google.cloud import firestore 
from flask import Flask, jsonify, abort, request 
import os  

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "./key.json"  

# database connection
db = firestore.Client() 
reviews_ref = db.collection("reviews")  

app = Flask(__name__)  

@app.errorhandler(404) 
def resource_not_found(e):   
    return jsonify(error=str(e)), 404  

@app.route('/review/', defaults={'id': None})
@app.route('/review/<id>') 
def get_review(id):   
    if id is None:     
        score = request.args.get('score')     
        genre = request.args.get('genre')      
        
        query = pitchfork_ref      
        
        if score is not None:       
            query = query.where('score', '==', float(score))            
        if genre is not None:       
            query = query.where('genre', '==', genre)      
            
        query = query.order_by("selection_count").limit(1).stream()             result = None      
        
        for doc in query:       
            id = doc.id       
            result = doc.to_dict()       
            result['id'] = id       
            break      
            
        # if query was empty, result will still be None     
        if result is None:       
            abort(404, description="incorrect query")     
        else:       
            pitchfork_ref.document(id).update(      
                {"selection_count": firestore.Increment(1)})      
            return jsonify(result)    
    else:     
        doc = pitchfork_ref.document(id).get()      
        
        if doc.exists:      
            result = doc.to_dict()       
            result['id'] = id       
            return jsonify(result)     
        else:       
            abort(404, description="id not found")

The code within the if id is None: block looks to see if a score or genre was specified by a query parameter, and then restricts the Firestore query to only those reviews with that score and/or genre. query.order_by("selection_count") chooses the least-seen review, as selection_count is updated with pitchfork_ref.document(id).update({"selection_count": firestore.Increment(1)}) whenever a review is returned. If an id is specified, the server queries for that specific review without incrementing the selection_count.

To deploy this API you could use something like Google App Engine, which like Firestore has a free tier and will scale as needed. Alternatively, you could just run it on your own VPS and avoid the possibility of surpassing the free tier as your API gets more usage.

View a standalone version of this generator here, as well as a discussion of a similar model fine-tuned on contemporary art reviews here.