A look Into holy books and their words. This was my final projects and is a continuation of my work with SBBW . It is also a work in progress.
It would be an understatement to call the Qur’an, Tanakh, and New Testament anything short of some of the most influential and most read texts on earth. But despite the texts highly studied nature there still exists mysteries within the texts that have gone unstudied. It is not the issue of those who study these texts or of time spent engaged with them, but rather an issue of technology. We now possess the tools that are capable of delving past the format of reading and study, and into a realm of words processed in a manner that can show us more than traditional study ever could. This is an exercise in our understanding of text as a form, and the new vs old methods of examining text. Long has our understanding of text been far too linear and based in the restraints of reading. Time spent reading, the difficulties involved in traditional analysis, the background needed for understanding even a single text, and the cultural understanding of the works at hand are all issues that emerge within traditional inquiry. by utilizing new methods of examination we can not only delve deeper in our inquiries but also do so with greater ease. The way in which we currently analyze texts is an inefficient process. In utilizing these new methods we can better understand texts and in some cases reveal new information that previous methods failed to uncover.
By using texts that are not only commonly available but also commonly understood we can better analyze the methods used. This is the reasoning for the three abrahamic texts. I do not mean to insinuate any superiority or theological hierarchy, simply that these texts are full of data and understood already by most people.
The methods being employed consists primarily of using the open source programing language ‘R’ to run various text mining scripts I have compiled and written in order to open these texts up. Surprisingly the majority of this project wasn’t actually spent delving into the texts, but rather in learning the methods and developing the tools required to analyze. Once all of the necessary tools had been developed the actual analysis was as simple as running the code and sorting through the results. When all was said and done, months of development constituted for just a few minutes of processing. The actual process and actions of the scripts began with loading all three texts in plain text format (txt.). Documents were then trimmed of all ‘stop words’, these are words that are considered to have little to no analytic value (a, and, the, also etc.). Next punctuation, numbers and other characters that may further skew data were removed. In certain cases documents may have specific words that need to be removed that are not part of the documents content, for example chapter headers or publishing info, these were dealt with on a case by case basis. The documents were then converted into a document term matrix, a way of staging words that order them by frequency of occurrence across documents. Once the stageing of the data was complete analysis was able to commence. Initially the methods used were simple queries into the individual documents. Most frequent terms, least frequent terms, and visualizations of these were performed to give a rough idea of the texts content. Next more complex analysis was done to give a nuanced look at how the texts were interconnected not only topically but also by linked common terms and semantic comparisons. The three tests performed were, Hierarchical Cluster Analysis(HCA), Latent Dirichlet Analysis(LDA) and Latent Semantic Analysis(LSA). Each test of these test primary purpose was to find patterns that were present within the texts but not visible or even possible to view upon standard readings of the texts. One things that should be noted about these analysis is that rather than being performed on the three abrahamic texts in whole as three separate files these analyses were performed on the the books of Matthew, Mark, John and Luke from the New Testament, The books of Joshua, Judges, Samuel and Kings from the Nevi’im within the Tanakh and the Qur’an in full. This truncated sample was chosen for a few reasons. Primarily the amount of time it would take to process all of the separate documents from each text i.e each individual book would have taken quite a bit of time as well as processing power to complete. For the sake of this paper’s completion this was cut down to four books from each of the texts respective sections pertaining to their prophets. The exception to this was the Qur’an as it does not follow the same written chapter or writing conventions as the other two texts. Where the New Testament and the Tanakh have a series of books held within in a compilation from different authors, the Qur’an is a single succinct text without authorial breaks. It is because of this that some of the clustering may seem to plot the Qur’an as an outlier, however methods to improve this analysis are still being work on.
Hierarchical Clustering Analysis
The first analysis of the texts I performed was HCA. It focuses on creating clusters of data based on distances within a given data matrix. Plotted in fig. 6 is a dendrogram, of said data. The simplest way to interpret this chart is to view it as a visualization of distance and similarity; The further away a document is from another the more dissimilar they are, conversely the closest linked terms are more similar. More similar in what ways? One might ask, this particular chart focuses on the similarity of language across documents.
The data seen within this chart is unfortunately bland but confirming of basic assumptions. The three grouping that stand out are from left to right the three books for which the subtexts were pulled from. These can be seen by the delineating boxes in red. The Qur’an is furthest left separate from the other texts, in the middle we see the books of the Nevi’im and then furthest right is the books of the New Testament. This may seem to be fairly obvious data but the goal of this test was to reveal information that may be hidden within texts and though it may be easy enough to assume that texts from the same book have similar word use it does no good to assume anything when testing hypothesis. Thus we are left with data that confirms what we already knew. This isn’t necessarily a bad thing, despite the larger view of the data giving us some predictable results if we look closer into the lower levels of the graph we can now see connections between the individual books within the texts. Starting with the New Testament we can see similarity between Luke and Matthew, and John and Mark. The Nevi’im has a few more more books that were plotted. This is because of the split nature of book of Samuel and the book of Kings, each of which has two parts. What is interesting about this in particular is that neither of these parts are linked to their corresponding piece i.e Samuel 1 is not as closely linked to Samuel 2 as one would expect them to be. In fact Kings 1 and Samuel 2, and Samuel 1 and Kings 2 are closer in relation to each other than they are to their own pair despite being from the same book. These methods can serve to both confirm theories about texts that may be visibly obvious, but also they can serve to reveal new information about texts that could not be seen even by the most versed scholar. It is this important element that can help us better stitch together the differences between texts.
Latent Dirichlet Analysis
The extent purpose of LDA is to sort texts not by words or by document linkage but rather to assign linkage via topics. These topics are found using an unsupervised machine learning algorithm. Author Matthew Jockers simplifies the process sans math in a rather amusing explanation. The setting involves Herman Melville and Ernest Hemingway attempting to determine what topics appear on the menu of the LDA buffet, a restaurant that only serves leit motifs. His explanation is as follows.
“Hemingway knows that the two manuscripts were written based on some mixture of topics available at the LDA Buffet. So to improve on this random assignment of words to topic bowls, he goes through the copied manuscripts that he kept as back ups. One at a time, he picks a manuscript and pulls out a word. He examines the word in the context of the other words that are distributed throughout each of the six bowls and in the context of the manuscript from which it was taken. The first word he selects is “heaven,” and at this word he pauses, and asks himself two questions:
- “How much of ‘Topic A,’ as it is presently represented in bowl A, is present in the current document?”
- “Which topic, of all of the topics, has the most ‘heaven’ in it?” . . .
As Hemingway examines each word in its turn, he decides based on the calculated probabilities whether that word would be more appropriately moved into one of the other topic bowls…Hemingway must now run through this whole process over and over again many times. Ultimately, his topic bowls reach a steady state where words are no longer needing to be being reassigned to other bowls”
As humorous as this explanation is it serves as an phenomenal explanation of both how LDA is processed by a computer as well as why it takes so long. For these particular documents the process took approximately fifteen minutes to run through 2000 iterations. This was only 11 documents, with approximately 100,000 unique words before trimming. Seen in fig. 8 are the topics LDA produced are unlabeled but we can loosely intuit what each one seems to be representative of. If we look at topic one in column one, it is primarily concerned with Jesus and the trinity therein. This is further shown if we look at the probability chart in fig. 9 with the LDA determining ‘jesus’ with a certainty of 0.81. Associated terms in topic 1 also include father, disciple, god, and answer. If one were to label this column it would most likely be ‘New Testament: People’ this is further affirmed when we look at column two. Column two is a bit less obvious as to its topic. Column two does feature words concerning actors however we can also see that words like kingdom, house, heaven, left, enter, and time also pop up. From this we could decide that the focus here is on area, time, and spatial words. We can now label this ‘New Testament: Places’. The third column stands as an outlier from the rest as it is features no paired column. It’s words are concerned with the Qur’an, as discussed previously because of the format of the Qur’an it is difficult to match up data with the other Abrahamic texts. Nevertheless we can compare topic words. Allah, thou, merciful, and believe all have prominence. The final two columns are both associated with the Old Testament. In similar fashion to the two New Testament columns the fourth column refers to actors and the fifth to place and time. One can label them ‘Old Testament:People’ and ‘Old Testament:Places’ respectively. It again may seem obvious that this data appears as such. However the purpose of this type of mining is to reveal things about the text in a manner that is both expository of information that may be hidden but also to hasten the process of collecting said information. Imagine for a moment that you had never read these texts, the daunting process of reading through all of these documents would be far to inefficient to gain the same amount of information. So instead we can use techniques such as LDA to quickly model the texts pertinent information for easy digestion.
Individual Analysis: Qur’an
After looking at all the texts together as a single dataset we can now move on to individual analysis to better understand the connection previously made. The analysis for the texts as individuals is fairly straightforward. The texts were processed in the same manner as described above. The data collected was concerned with which terms appear most frequently, from this a bar graph and word cloud were plotted. One point of note on these two methods of visualization is that they fail at showing negation i.e. they will show that the word ‘sin’ might appear 100 times, leading one to think this was an endorsement. However it would not accurately show that the words ‘do not’ appear before these 75 of the 100 instances. This isn’t to say that these charts do not serve well to provide an at a glance look at their data.
Fig. 3 and Fig. 4 were both constructed by finding the words that appeared more than 300 times and plotting the top 20 of these terms. Allah, lord, surely, will, and say appear most frequently within this text. Had we never seen this text before one could make the accurate assumption that this book pertains primarily to what Allah says. We could even go deeper into these charts and see that good, earth, people, and thou are the following topic words that are plotted giving us the predictive topics of the text. Interestingly enough the particular lore of the Qur’an is that it is the direct words of Allah as they were said to Muhammad, is would seem that our analysis of the text speaks to this fact fairly accurately. This is already known and yet again the data confirms it.
Moving into the Tanakh, it should be noted that due to the longer length of this text the cut off for how many occurrences of a term were needed to be plotted is 1000(fig.4-5) or more. From the plots we see lord, shall, will, unto, and thou are most frequently mentioned. Again to postulate what this text might be about. We can see that it seems to have a more observational tone and focused on actions. Where the Qur’an ‘says’ the Tanakh’s assumed primary actor ‘lord’ ‘will’ do/does. When comparing the Tanakh and the New Testament the language is fairly similar. Most variation between the two comes from the primarily mentioned actor where the Tanakh’s main is ‘lord’ the New Testament primarily features ‘God’ as its primary actor. This is also apparent in the LDA and HCA analysis.
The final text to look at is the New Testament. The peculiar thing to notice is the prevalent use of the word unto. This is a great example of the necessity of human interpretation data. What is important to consider is the value of words as it pertains to one’s analysis. Typically we are concerned with words that provide useful information to us as readers. A crucial step in preprocessing data is trimming and stemming data, this gives us an opportunity to clear the document of words that have little value to content analysis, words like the, is, a etc. tell us very little about the document. When it comes to the New Testament the challenge we are met with is the perspective value of the word ‘unto’ (fig.1-2)to our analysis. It is not as if this word has no value as it does communicate certain ideas about how the text is written. But within normal analysis of text prepositions are rarely used due to the fact that they often provide very little. However through a fluke of the list I used to trim these documents having its roots in modern english rather than biblical english words like unto, thee, thou etc. made their way into these charts. Two different actions could have been taken, remove these words, or keep them and postulate textual theories based off their use. I chose that latter as I am confident that these passive words are often more important that people give them credit for. So what can be gleaned from ‘Unto’? One could notice that different from the other documents the New Testament has a high mention of “saying ___ unto” now this is found also in the Tanakh but the differing factor is to whom it is being said. Within the new testament Jesus is the primary receiver of ‘unto’ whereas in the Tanakh the receiver is people, man and men. The Qur’an in this case seems to have more similarities to the Tanakh with higher uses of thee and thou. This particular dive into the word ‘unto’ would serve as an excellent point of inquiry for further research into the value of words within a historical context.
The way in which we are able to communicate and process words is constantly evolving. We now exist in a point of division. Now more than ever are we questioning words and texts; The truth of a word, the credibility of who said it, and how it connects to other texts and speeches. We also exist in a time where the ability to test and analyze said words/works has never been easier or more advanced (In addition to being widely and freely available). So the question of what to do given the availability of words in need of examining and tools on the table ready for use? It should seem obvious. If this inquiry serves to demonstrate that even with texts we are familiar with we can still uncover new information or confirm our assumptions. There is a disconcerting trend of a lack of reading in recent time that seems to stem from the level of concise processing that has gone into most of our consumed media. That is to say we no longer engage in the full reading and long from of analysis that we previously have. This is not to critique the value of condensing information but rather that we should not leave the condensation to other rather perform it ourselves. The value of processing information independently and ensuring that the information we use to inform our decision is as uncontaminated as possible. The cleanliness of words may very well be the key issue that we must tackle as a collective group. It would stand to reason that the best way to do this in the most time effective manner would be through the use of the powerful digital tools at our disposal. By doing so we can aim to create a more truthful and objective state of affairs with the assistance of machines.
Awati, Kailash. “A gentle introduction to topic modeling using R.” Eight to Late. Accessed March 16, 2017. https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/.
Awati, Kailash. “A gentle introduction to cluster analysis using R.” Eight to Late. Accessed March 16, 2017. https://eight2late.wordpress.com/2015/07/22/a-gentle-introduction-to-cluster-analysis-using-r/.
Awati, Kailash. “A gentle introduction to text mining using R.” Eight to Late. Accessed March 16, 2017. https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/.
N/A. “Basic Text Mining in R.” Basic Text Mining in R. Accessed March 16, 2017. https://rstudio-pubs-static.s3.amazonaws.com/31867_8236987cf0a8444e962ccd2aec46d9c3.html.
N/A. “Beautiful dendrogram visualizations in R: 5 must known methods – Unsupervised Machine Learning.” STHDA. Accessed March 16, 2017. http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning.
Gaikwad, Sonali, Archana Chaugule, and Pramod Patil. “Text Mining Methods and Techniques.” Text Mining Methods and Techniques. January 2014. Accessed March 16, 2017. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.428.8805&rep=rep1&type=pdf.
N/A. “Hierarchical cluster analysis.” Hierarchical cluster analysis. Accessed March 16, 2017. http://22.214.171.124/~michael/stanford/maeb7.pdf.
Jockers, Matthew. “The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors.” Matthew L. Jocker .Net. September 29, 2011. Accessed March 16, 2017. http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/.
Kassambara, Alboukadel. “Cluster Analysis in R – Unsupervised machine learning.” STHDA. Accessed March 16, 2017. http://www.sthda.com/english/wiki/cluster-analysis-in-r-unsupervised-machine-learning#at_pco=smlwn-1.0&at_si=58be4070490d48c1&at_ab=per-2&at_pos=0&at_tot=1.
Stewart , Brandon M. “Practical Skills for Document Clustering in R .” Practical Skills for Document Clustering in R . June 15, 2010. Accessed March 16, 2017. https://faculty.washington.edu/jwilker/tft/Stewart.LabHandout.pdf.