Wednesday, November 20, 2019

Introducing the Shakespeare Affinity Test

I developed the Shakespeare Affinity Test (SAT) out of a simple hypothetical. What if the First Folio contained only 35 plays, and a 36th play was discovered. Could we devise an objective test, based on the other 35 plays, that would tell us, with some confidence, whether it was written by William Shakespeare?

The SAT succeeds quite well in that goal. Out of 300+ plays written before 1620, it groups 27 of the 36 First Folio plays in the top 30 results. It also identifies Two Noble Kinsmen and Pericles -- two plays generally believed to be co-written by Shakespeare but not included in the First Folio -- as potentially written by Shakespeare.

I call it an "affinity" test rather than an "authorship" test because there is no known method for determining authorship with any degree of confidence. The Shakespeare canon also may be a unique case. His vocabulary was unusual, and the canon is unusually large. So, the test also may not have general applicability for determining authorship with other authors. Even with regard to Shakespeare's plays, it is only a guide to determining authorship. It is not a definitive test.

That said, it is an objective, easily modifiable, and extensible rare word test. I hope it will find many uses. I will cover some of those below and in future blog posts.

Design Criteria for Shakespeare Affinity Test (SAT) 

I had a few key criteria for my test:
1. Objective. The test would need to be as straightforward and unbiased as possible.
2. Reproducible. Anyone should be able to reproduce the test and get the same results. It should also be easy to modify the test to see if the results are due to cherry-picked parameters or other biasing factors.
3. Valid. There should be good reason to think that the test actually works.

Fortunately, independent researcher Pervez Rizvi has developed a database of the full text of 500+ English language plays from the Early Modern period. This database is a free, publicly available corpus. You can download it from his website. My test runs on this database completely unmodified. It runs using standard SQL code on the MySQL database, available for free. Anyone can run the same test and get the same results.

Note, I am not going to spend time and energy critiquing other authorship tests done in the field of Shakespeare scholarship. But all tests must meet the above three criteria. These are the most basic standards of good scholarship.

General Theory Behind a Rare Word Test

Let me give an example of the theoretical motivation behind the test. The word “cicatrice” (a scar) occurs in four plays of Shakespeare. However, the word only occurs in one other play in the database of 500+ plays. So "cicatrice" serves as a marker for a Shakespeare play.  If we found a previously unknown play from that period with the word “cicatrice” in it, that might be a hint that the play was written by Shakespeare.

The beauty of my test is that it has an objective standard for what is a "rare word." It uses the prevalence of words in the entire database to determine which ones are rare. The characteristics of that definition can easily be changed to narrow the query. For instance, you could look for rare words during a certain time period or look for unusual words that begin with the prefix "un-" or the suffix "-ate". I have done such tests -- with interesting results -- that I will be sharing in time.

Once again, this is an "affinity" test, not an "authorship" test. I am not claiming it produces a reliable determination of authorship. It gives useful and objective information about word usage in early modern English plays. That information can be helpful in determining authorship.

How the Test Works

Pervez Rizvi's database lists the “lemmatized” form for each word in a play. This is the dictionary headword form; so for instance “buy,” “bought,” and “buying” would all be listed as “buy.” This simplification makes comparison straightforward. After all, we want “apple” and “apples” to count as the same word. The test runs on the lemmatized forms of words.

I will now describe the test in an algorithmic form. This is the procedure used to produce the results.

1. Select the lemmas in the database that occur 20 times or fewer. (Note. this is not counting how many plays it occurs in, but how many total occurrences). Let's call this list RARE_LEMMAS. These are words that objectively aren't very common in early modern English plays. (Note: 20 is an arbitrary number, and I encourage people to play with different parameters and compare results.)

2. Select the lemmas that occur in Shakespeare's First Folio plays. For Henry VIII, only use the Shakespeare section, not the Fletcher section. Pervez Rizvi's database has this division built-in. Let's call this list SHAKESPEARE_LEMMAS.

3. Select only the words in the RARE_LEMMAS list that also occur in the SHAKESPEARE_LEMMAS list. This gives you RARE_SHAKESPEARE_LEMMAS.

4. Count how many of the RARE_SHAKESPEARE_LEMMAS occur in each play. Count each lemma only once. This gives you the RAW_SCORE.

5. For First Folio plays, you run the same test, but first create a special SHAKESPEARE_LEMMAS list from only the other 35 First Folio plays. Using that, you create a RARE_SHAKESPEARE_LEMMAS list to compare with the play and generate a RAW_SCORE.

The RAW_SCORE isn't scaled in any way for the length of the play. The number is useful but it provides a somewhat skewed result. To scale it, you divide the RAW_SCORE by the total number of unique lemmas in each play. This number, NUM_TOKENS, is included in the database for each play. This produces a SCALED_SCORE. Multiply this number by 1000 to get the FINAL_SCORE.

Once again, I encourage people to modify the parameters as they see fit. But the simple fact is, the test works. It can identify plays written by Shakespeare, especially later plays, reasonably well. It works less well for very early plays, and I will discuss this issue in a later blog post.

Here are the rough results. This is out of approximately 300 plays written before 1620. Shakespeare's plays are very strongly grouped to the top of the list. You should be able to click on each image to read it clearly:



So, for instance, The Tempest scores 39.69 while Jonson's The Alchemist scores only 27.25. Hamlet scores highest with 67.70. Interestingly, Beaumont and Fletcher's A King and No King scores extremely low with 12.07. Over time I will release more detailed data along with the SQL code. This blog post is just designed to give people a general idea of how the test works.

Other Uses for the Test

The test has many uses. It can be used to help divide up plays based on co-authorship or determine whether a play is co-authored. It also can help identify when specific words enter the Shakespeare canon and how unusual they were at the time. The high scores for What You Will and The Devil's Charter should spur further research, and I will have more to say on this later.