Showing posts with label Natural Language Processing. Show all posts
Showing posts with label Natural Language Processing. Show all posts

Wednesday, November 20, 2019

Analyzing Co-Authorship in Timon of Athens using the Shakespeare Affinity Test

It has been suggested that Shakespeare's play Timon of Athens might have been co-authored with Thomas Middleton. Here I will use some techniques made possible by my Shakespeare Affinity Test to investigate this question.

Plotting the Unusual Shakespeare Words

As I explained in my blog post, the Shakespeare Affinity Test identifies words that occur in Shakespeare plays but are relatively unusual compared with the rest of the database of 500+ plays. By charting where these words occur in a play possibly co-authored by Shakespeare, it might be possible to determine which sections were written by Shakespeare and which sections were written by someone else.

Note, the Shakespeare Affinity Test is not an authorship test. There is currently no reliable method for determining authorship. The SAT is just a tool for analyzing the unusual vocabulary in a play.

I ran Timon of Athens using standard parameters of the SAT to create this chart that plots these "hits":


As you can see above, there are several gaps where there are fewer rare Shakespeare words. These are:

Gap 1: Words 797-1799. Roughly Act 1, Scene 1, Line 119-275.
Gap 2: Words 8945-10325. Essentially all of Act 3, Scene 5 and 6.
Gap 3: Words 14754-16259. Most of Act 5, Scene 1.

Comparing with Wikipedia's summary of one person's analysis:
John Jowett, editor of the play for both the Oxford Shakespeare: Complete Works and the individual Oxford Shakespeare edition, believes Middleton worked with Shakespeare in an understudy capacity and wrote scenes 2 (1.2 in editions which divide the play into acts), 5 (3.1), 6 (3.2), 7 (3.3), 8 (3.4), 9 (3.5), 10 (3.6) and the last eighty lines of 14 (4.3).
The only agreement between the above chart and this analysis seems to be on Act 3, Scene 5 and 6.

Analyzing Unusual Words in the Gaps

I ran a rare word test on just the gaps. These are rare words that occur in the gaps, no other Shakespeare play, and only 20 or fewer times in the whole database:

unclew, 1354 -- unique
repugnancy, 9067 -- The Duchess of Suffolk
byzantium, 9181 -- Selimus, Hans Beer-Pot
briber, 9185 -- unique
rioter, 9235 -- Mucedorus; Middleton, A Trick to Catch the Old One (twice) and Michaelmas Term; Yorkshire Tragedy
usure, 9579 -- common word
exceptless, 15162 -- unique
usure, 15260 -- common word
phrynia, 15531 -- unique
timandra, 15533 -- The Bondman, Philip Massinger (1624)
opulency, 15798 -- unique

If Middleton wrote these sections, you might expect there to be words not found in other Shakespeare plays that Middleton commonly used. However, there is little evidence here for Middleton's authorship of these passages. The only word of interest is "rioter" in Act 3, Scene 5 of the play.

Analyzing Unique Words in Timon of Athens

Using the database, we can easily generate a list of all of the words that appear in Timon of Athens but in no other Shakespeare First Folio plays. There are 91 such words:

alcibiades, apemantus, approacher, ardent, balsam, banditti, blain, brevis, briber, byzantium, caphis, carper, castigate, cauterize, composture,, confectionary, contentless, decimation, defiler, detention, distasteful, dividant, droplet, embalm, ensear, exceptless, exhaust, flaminius, fragile, furor, gluttonous, hortensius, indisposition, insculpture, ira, isidore, jutting, lacedaemon, liquorish, lucullus, madwoman, mangy, manslaughter, misanthropos, monstrousness, mountant, numberless, nutriment, oathable, obliquy, opulency, passive, pelf, penurious, philotus, phrynia, procreation, rampire, recanter, recoverable, regardful, regular, reliance, repugnancy, rioter, sacrificial, servilius, slavelike, softness, solidar, spilth, steepy, straggle, suitable, timandra, towardly, trenchant, unagreeable, unaptness, uncharged, unclew, unctuous, unpeaceable, untirable, usure, varro's, viced, wappened, whittle

We can use this as a basis to compare with the other plays in the database. I ran a test on five of Middleton's plays from the period; of those 91 words, these are the ones that showed up in those plays:

The Phoenix (1603-4) -  balsam
Michaelmas Term (1604) - rioter, towardly
Trick to Catch the Old One (1605) - penurious, rioter, usure
Mad World - none
A Chaste Maid in Cheapside (1613) - 37 - suitable

"penurious" was pretty common in plays of the period as was "balsam". "towardly" was in Eastward Ho, so Middleton's use isn't notable. "usure" was relatively common; it's in Volpone and other plays.

Really the only word of interest is "rioter". As explained above, that word appears in Act 3, Scene 5 of Timon of Athens, in one of the gaps and was not a very common word in plays of the period. Middleton uses it in two plays of the period and if he actually wrote Yorkshire Tragedy, in three plays. 

Comparing Relatively Rare Words in Middleton's plays with Timon of Athens

Now I will do a mini "Middleton Affinity Test". I will select out all of the unusual words in the above five plays and see how often and where they occur in Timon of Athens. Some of these words occur in other Shakespeare plays. Here is the list:

6185 scarcity
6538 towardly
7121 disfurnish
8885 unnoted
9235 rioter
9579 usure
11539 spital
12502 quillet
14682 thievery
14691 attraction
15260 usure

Most of these words we looked at above and the other ones are not particularly notable. "spital" is common in Shakespeare; "attraction" is in Merry Wives of Windsor and Pericles; "scarcity" is in Venus and Adonis; "quillet" is very common; "unnoted" is in Rape of Lucrece. "disfurnish" is in Two Gentlemen of Verona and Pericles.

This list actually shows the likelihood that Middleton borrowed some unusual words from Shakespeare rather than anything else. It certainly is not evidence of co-authorship.

Conclusion

There is no reason to conclude from this that Thomas Middleton wrote a substantial portion of Timon of Athens. There are some gaps in rare Shakespeare words, particularly in Act 3, Scene 5 and 6. If you want to look for co-authorship, I would look there.  The attribution to Middleton is possible, but seems largely unsupported by this data.

Introducing the Shakespeare Affinity Test

I developed the Shakespeare Affinity Test (SAT) out of a simple hypothetical. What if the First Folio contained only 35 plays, and a 36th play was discovered. Could we devise an objective test, based on the other 35 plays, that would tell us, with some confidence, whether it was written by William Shakespeare?

The SAT succeeds quite well in that goal. Out of 300+ plays written before 1620, it groups 27 of the 36 First Folio plays in the top 30 results. It also identifies Two Noble Kinsmen and Pericles -- two plays generally believed to be co-written by Shakespeare but not included in the First Folio -- as potentially written by Shakespeare.

I call it an "affinity" test rather than an "authorship" test because there is no known method for determining authorship with any degree of confidence. The Shakespeare canon also may be a unique case. His vocabulary was unusual, and the canon is unusually large. So, the test also may not have general applicability for determining authorship with other authors. Even with regard to Shakespeare's plays, it is only a guide to determining authorship. It is not a definitive test.

That said, it is an objective, easily modifiable, and extensible rare word test. I hope it will find many uses. I will cover some of those below and in future blog posts.

Design Criteria for Shakespeare Affinity Test (SAT) 

I had a few key criteria for my test:
1. Objective. The test would need to be as straightforward and unbiased as possible.
2. Reproducible. Anyone should be able to reproduce the test and get the same results. It should also be easy to modify the test to see if the results are due to cherry-picked parameters or other biasing factors.
3. Valid. There should be good reason to think that the test actually works.

Fortunately, independent researcher Pervez Rizvi has developed a database of the full text of 500+ English language plays from the Early Modern period. This database is a free, publicly available corpus. You can download it from his website. My test runs on this database completely unmodified. It runs using standard SQL code on the MySQL database, available for free. Anyone can run the same test and get the same results.

Note, I am not going to spend time and energy critiquing other authorship tests done in the field of Shakespeare scholarship. But all tests must meet the above three criteria. These are the most basic standards of good scholarship.

General Theory Behind a Rare Word Test

Let me give an example of the theoretical motivation behind the test. The word “cicatrice” (a scar) occurs in four plays of Shakespeare. However, the word only occurs in one other play in the database of 500+ plays. So "cicatrice" serves as a marker for a Shakespeare play.  If we found a previously unknown play from that period with the word “cicatrice” in it, that might be a hint that the play was written by Shakespeare.

The beauty of my test is that it has an objective standard for what is a "rare word." It uses the prevalence of words in the entire database to determine which ones are rare. The characteristics of that definition can easily be changed to narrow the query. For instance, you could look for rare words during a certain time period or look for unusual words that begin with the prefix "un-" or the suffix "-ate". I have done such tests -- with interesting results -- that I will be sharing in time.

Once again, this is an "affinity" test, not an "authorship" test. I am not claiming it produces a reliable determination of authorship. It gives useful and objective information about word usage in early modern English plays. That information can be helpful in determining authorship.

How the Test Works

Pervez Rizvi's database lists the “lemmatized” form for each word in a play. This is the dictionary headword form; so for instance “buy,” “bought,” and “buying” would all be listed as “buy.” This simplification makes comparison straightforward. After all, we want “apple” and “apples” to count as the same word. The test runs on the lemmatized forms of words.

I will now describe the test in an algorithmic form. This is the procedure used to produce the results.

1. Select the lemmas in the database that occur 20 times or fewer. (Note. this is not counting how many plays it occurs in, but how many total occurrences). Let's call this list RARE_LEMMAS. These are words that objectively aren't very common in early modern English plays. (Note: 20 is an arbitrary number, and I encourage people to play with different parameters and compare results.)

2. Select the lemmas that occur in Shakespeare's First Folio plays. For Henry VIII, only use the Shakespeare section, not the Fletcher section. Pervez Rizvi's database has this division built-in. Let's call this list SHAKESPEARE_LEMMAS.

3. Select only the words in the RARE_LEMMAS list that also occur in the SHAKESPEARE_LEMMAS list. This gives you RARE_SHAKESPEARE_LEMMAS.

4. Count how many of the RARE_SHAKESPEARE_LEMMAS occur in each play. Count each lemma only once. This gives you the RAW_SCORE.

5. For First Folio plays, you run the same test, but first create a special SHAKESPEARE_LEMMAS list from only the other 35 First Folio plays. Using that, you create a RARE_SHAKESPEARE_LEMMAS list to compare with the play and generate a RAW_SCORE.

The RAW_SCORE isn't scaled in any way for the length of the play. The number is useful but it provides a somewhat skewed result. To scale it, you divide the RAW_SCORE by the total number of unique lemmas in each play. This number, NUM_TOKENS, is included in the database for each play. This produces a SCALED_SCORE. Multiply this number by 1000 to get the FINAL_SCORE.

Once again, I encourage people to modify the parameters as they see fit. But the simple fact is, the test works. It can identify plays written by Shakespeare, especially later plays, reasonably well. It works less well for very early plays, and I will discuss this issue in a later blog post.

Here are the rough results. This is out of approximately 300 plays written before 1620. Shakespeare's plays are very strongly grouped to the top of the list. You should be able to click on each image to read it clearly:



So, for instance, The Tempest scores 39.69 while Jonson's The Alchemist scores only 27.25. Hamlet scores highest with 67.70. Interestingly, Beaumont and Fletcher's A King and No King scores extremely low with 12.07. Over time I will release more detailed data along with the SQL code. This blog post is just designed to give people a general idea of how the test works.

Other Uses for the Test

The test has many uses. It can be used to help divide up plays based on co-authorship or determine whether a play is co-authored. It also can help identify when specific words enter the Shakespeare canon and how unusual they were at the time. The high scores for What You Will and The Devil's Charter should spur further research, and I will have more to say on this later.