Analysis of Shakespeare’s work using Python 3

Image created by Mr. Chandra Sekhar Kounduri for fermibot.wordpress.com 


After learning about the Zipf’s law, I wanted to find those patterns myself. I am learning python3 now and I was looking at examples on how to read text files and the individual lines in them. I have found the complete works of Shakespeare at this page. http://shakespeare.mit.edu/

More information on the Zipf’s law is available here:

After getting text from the webpages into text files,  these text files were imported into the python environment and an analysis was performed on it. The analysis of the complete works has resulted in the following observations. Presented below is the output of the python console. I have used Anaconda 4.2.0 with Python 3.5 version.

The goal is to count the number of occurrences of each of the english alphabet.

william_shakespere_comedy_analysis_cumulative


A table is attached below for reference.

Letter Letter count
e 424718
t 300213
o 287878
a 250548
h 227474
s 224207
n 222907
r 216036
i 206018
l 151036
d 139409
u 117145
m 98215
y 87237
w 76331
f 70909
c 69184
g 59685
p 48470
b 48311
v 35293
k 30909
x 4671
q 2862
j 2829
z 1350

Graph of the data done in microsoft excel:

We can see that the count mimics an exponentially dropping distribution.

william_shakespere_comedy_analysis_cumulative_graph

Analysis of Words

A similar analysis has been done on the words and it resulted in some interesting results. In this case though, the following procedure has been done. An  empty dictionary has been created and all the text files have been read consecutively while updating the newer words in the dictionary. The plot of the top 50 words is given below.

words_plot

We can see that the top used words are ‘the’, ‘and’, ‘i’, etc. There are also some archaic words that were used often. Examples include – ‘thee’, ‘thou’, ‘thy’ and some others. Please note that there were more than 30,000 unique words in all his work. The above plot has only the top 50 used words.

If you have any doubts, you can drop a comment.


Code: 

The code used for letter analysis is available here Code

The code used for word analysis is available here Code

A zipfile with all the works of William Shakespeare in .txt format is found HERE 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

 

End of the post 🙂