Classified the boolean true and null literal values closed category

Empirical Software Engineering manuscript No. (will be inserted by the editor)

Studying the Difference Between Natural and Programming Language Corpora

Casey Casalnuovo
Department of Computer Science, University of California, Davis, CA, USA E-mail: ccasal@ucdavis.edu

Kenji Sagae
Department of Linguistics, University of California, Davis, CA, USA
E-mail: sagae@ucdavis.edu

	Casey Casalnuovo et al.

Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do (Knuth, 1984).

Software development is usually a team effort; code that cannot be understood and main-tained is not likely to endure. It is well known that most development time is spent in main-tenance rather than di novo coding (Lehman, 1980). Thus it is very reasonable to consider source code as a form of human communication, which, like natural languages, encodes information as sequences of symbols, and is amenable to the sorts of statistical language models (LM) developed for natural language. This hypothesis was originally conceived by Hindle et al. (Hindle et al, 2012), who showed that LM designed for natural language were actually more effective for code, than in their original context. Hindle et al used basic ngram language models to capture repetition in code; subsequent, more advanced models, tuned for modular structure (Tu et al, 2014; Hellendoorn and Devanbu, 2017), and deep learning ap-proaches such as LSTMs (Hochreiter and Schmidhuber, 1997) (with implementations such as (White et al, 2015; Khanh Dam et al, 2016)) yield even better results. Fig 1 demonstrates this difference on corpora of Java and English, using the standard entropy measure (Manning and Sch¨utze, 1999) over a held-out test set. A lower entropy value indicates that a token was less surprising for the language model. These box plots display the entropy for each token in the test set, and show that (regardless of model) Java is more predictable than English1.

2. How much does programming language syntax influence repetitiveness in coding? and 3. What are the contingent factors (not constrained by syntax) that play a role in code repetitiveness?

1 Precise details on the datasets and language models will be presented later their respective sections.

	3

3. Is repetitiveness observed in code also observed in other natural language corpora that similarly required significant effort from the creators?

We address this question, with corpora of text that are similarly ”effortful” for the writers (or readers, or both) or have potentially higher costs of miscommunication: we consider English as a second language and in specialized corpora such as legal or technical writing.

– Our findings on syntax have practical consequences: they help significantly improve code suggestion on open category tokens in Java, which are harder for language models to predict but useful for programmers.

These suggest that differences observed between natural and programming languages are not entirely due to grammatical limitations, and that code is also more repetitive due to contingent facts – i.e. humans choose to write code more repetitively than English. Our ex-periments with bodies of text (other than code) that require greater effort indicate that people

	Casey Casalnuovo et al.

A language’s syntax constrains the set of valid utterances. The more restrictive the gram-mar, the less choice in utterances. Thus, it is possible that the entropy differences between code and NL arise entirely out of the more restrictive grammar for code. If so, the observed differences are not a result of conscious choice by humans to write code more repetitively; it is simply the grammar.

However, if we could explicitly account for the syntactic differences between English and code, and still find that code is repetitive, then the unexplained difference could well arise from deliberate choices made by programmers. Below, we explore a few theories of why the syntax of source code may be more repetitive than the syntax of natural language.

In code, reserved words, like for, if, or public form a closed set of language-specific keywords which help organize syntax. The arithmetic and logical operators (which combine elements in code like conjunctions in English) also constitute closed vocabulary. Code also has punctuation, like “;” which demarcates sequences of expressions, statements, etc. These

2 While this category is very rarely updated, there could be unusual and significant changes in the language– for instance a new preposition or conjunction in English.

	5

2.2 Ambiguity in Language

Programming language grammars are intentionally unambiguous, whereas natural languages are rife with grammatical ambiguity. Compilers must be able to easily parse source code; syntactic ambiguity in code also impedes reading & debugging. For example, in the C lan-guage, there are constructs that produce undefined behavior (See Hathhorn et al. (Hathhorn et al, 2015)). Different compilers might adopt different semantics, thus vitiating portability.

	Casey Casalnuovo et al.

They saw the building with a telescope.

There are two meanings, depending on where the phrase with a telescope attaches: did they see using the telescope, or is the telescope mounted on the building? Both meanings are valid, where one or the other may be preferred based on the context.

NP	VP
PRP
PRP		NP
they
	saw	NP		PP				PRP	VP		PP

	DT		NN					they			NP		IN			NP

	the				with	DT	NN	saw		DT		NN		with	DT
	a						telescope	the						a

Fig. 2 Two parse trees for the sentence They saw the building with a telescope. The tree on the left corre-sponds to the the reading that the telescope is part of the building; on the right, to the reading that the viewing was done with a telescope

A CPT fully resolves syntactic ambiguities: e.g., consider Fig. 2, which shows the two possible CPTs for our example sentence. While the raw text is ambiguous, each of the CPTs fully resolve and clarify the different possible meanings; only one meaning is possible for a given CPT. In source code, however, the syntactic structure is unambiguous, given the raw tokens.

	7

RQ2. When parse trees are explicitly included for English and Java, to what degree are the differences in predictability accounted for?

2.3 Explanations From Contingent Factors

3 Indeed, it is not clear how to even design such an experiment.

8	Casey Casalnuovo et al.

Attaining fluency in a second language is difficult. If humans manage greater language diffi-culty by deploying more repetitive and templated phrasing, then we might find evidence for this in English as a Foreign language (EFL) corpora.

Use of templated and repetitive language appears in linguistic research through the con-cept of formulaic sequences (Schmitt and Carter, 2004). These are word sequences that appear to be stored and pulled from memory as a complete unit, rather than being con-structed from the grammar. Such sequences come in many forms, one of the most common being concept of idioms, but the key point is that they are intended to convey information in a quick and easy manner (Schmitt and Carter, 2004). This theory is backed by empirical evidence, as both native and non-native readers have been found to read such phrases faster than non-formulaic language constructs (Conklin and Schmitt, 2008). Several studies have found that language learners acquire and use these sequences as a short hand to express themselves more easily, and thus use them more excessively than native speakers (Schmitt and Carter, 2004; De Cock, 2000; Paquot and Granger, 2012). We can see such use as an adaption for novices increased difficulty with the language. If we can statistically capture the patterns in written corpora of language learners and see similar trends as in source code, it would be consistent with the hypothesis that source code is more repetitive because it is more cognitively difficult. Therefore we ask the following questions:

	9

While differences between general and technical language use have long been a focus of linguists (Gotti, 2011), the attempts to categorize the differences between the two (Gotti, 2011) run into somewhat contradictory forces. Gotti cites Hoffman who gives 11 properties desirable in technical language, including unambiguousness, objectivity, brevity, simplic-ity, consistency, density of information, etc. The desire for a lack of ambiguity contradicts with the desire for a concise and informative text, as the meaning is also intended to be clear (Hoffmann, 1984; Gotti, 2011). Moreover, technical language is also heavily decided by the intended audience, ranging a spectrum from communication to laypeople (either for educational or general public dissemination) to communication between experts, which of-ten includes highly unambiguous mathematical formulations (Gotti, 2011). Expert to expert communication is characterized by usage of unexplained terminology, or jargon, which can be efficient (Varantola, 1986; Gotti, 2011). Moreover, technical language is marked by com-pound noun phrases, which may be easier for language models to detect. Salager et al. found that compared to the 0.87% rate of compounds in general English, technical language had them appear at a rate of 15.37% (Salager, 1983). Once learned, these instances jargon and compound phrases may act to reduce cognitive load for experts who recognize them, allow-ing for easier reference of complex ideas.

Additionally, longer sentences are associated with technical language, especially legal language, with increased length sometimes suggested as arising from a need for greater precision (Gotti, 2011). However, this claim of precision in legal language is disputed, as Danet points out that for being supposedly precise, laws often require extensive interpre-tation (Danet, 1980). Though there is evidence of political gamesmanship making the lan-guage overly verbose and complex, legal language and technical language in general are still driven in part by the need for precision and reduced ambiguity. Such language can be seen as more difficult or labored than general language, and we would expect it to feature more code-like properties.

	Casey Casalnuovo et al.

2.4 Measuring Repetition in Language

When studying repetition and comparing between our programming and natural languages we apply two general techniques - language modeling and Zipf frequency plots. This section provides some background on these methods. Specific details on how we extend and apply these method in our experiments can be found in sections 3.2 and 3.5.

P(S;LM) =	n ∏i=1		(1)

¯H(S;LM) = −1∥S∥∗	n ∑ i=1	log(P(ti\|C(ti);LM))	(2)

Originally proposed by Shannon (Shannon, 1948), who later used it to predict the next letter in a sequence of English (Shannon, 1951), entropy models the amount of information conveyed by a message. That is, if the message where to be translated to binary, what is the fewest number of bits required to encode it in the language model? The fewer the bits are needed encode the message, the less information (and thus more repetitive/predictable) the message. In the context of language models, entropy indicates how unexpected a token is, and acts as measure of how successful the language model is in capturing the underlying relevant features that characterize the grammar, vocabulary usage, and ideas of the text.

	11

N-gram models are the simplest: here, the context C(ti) is equivalent to the past n tokens in the sequence. For example, the probability of a sentence in a trigram model would be:

P(S) =	n ∏i=3	P(ti\|ti−2,ti−1)	(3)

	(4)

Finally, we also use Long Short Term Memory Network, or LSTM (Hochreiter and Schmidhuber, 1997). Unlike traditional feedforward models, these recursive neural net-works (RNNs) allow models to leverage variable-length contexts (Mikolov et al, 2010). LSTMs are RNNs, with the ability to choose to remember some of the prior elements of the sequence5. This “selective memory” allows LSTMs to learn longer contexts than the fixed ngram models.

LSTMs and RNNs have been applied to both natural (Mikolov et al, 2010; Sundermeyer et al, 2012) and programming languages (White et al, 2015; Khanh Dam et al, 2016). We include LSTMs to compare and contrast their ability to learn natural and programming lan-guages, but also to leverage their greater context when modeling our linearized parse trees. Much larger ngram models are needed to capture the text of these trees, but the selective learning of the LSTM is greater able to capture the repetition in them. We provide more details on these in sections 3.4 and 3.5.

	Casey Casalnuovo et al.

formula indicates a power-law relationship between the rank of a word and its frequency. By rank, we mean that the most frequent word receives rank 1 (or 0), then the next most frequent gets rank 2, and so on. Then, the frequencies of this words following roughly this formula:

f ≈C	(5)

Here, f represents the frequency of the word, C is a constant, r is the word rank, and αis the power used to fit the line (originally observed as being close to 1) to the data. This law was improved slightly to better fit very frequent and very rare words by Mandelbrot soon after (Mandelbrot, 1953). He proposed an additional constant b, which was better able to account for high frequency words in natural language texts:

f	C	(6)
≈	(r +b)α

	13

Language	# of Tokens	# of Unique Tokens	Projects
Java	16797357	283255	12
Haskell	19113708	473065	100
Ruby	17187917	862575	15
Clojure	12553943	563610	561
C	14172588	306901	10

3 Materials and methods

3.1 Data

We thus use an automated process that focuses first on collecting a sufficient amount of data, but still apply some constraints to filter out less meaningful projects and avoid projects that share code. First, we use GHTorrent (Gousios and Spinellis, 2012) to obtain a list of all

6
7 not hosted on GitHub, but is selected for significance within the Java community

	Casey Casalnuovo et al.

	Corpus	# of Tokens	# Unique Tokens
	Brown 1-Billion Sample	1209052 16444921
	NASA US Code Commit Messages Scifi Shakespeare Recipes
English as a Foreign Language	Gachon Teccl	3063661 2108397
			710301 453133

non forked projects in the language on Github, and select those with over 100 commits. Any project whose name directly contains the name of another project on the list is removed. We then parsed the git logs to verify the GHTorrent results and remove any projects under the commit threshold or with only 1 contributor.

We drew on a variety of natural language corpora to capture general characteristics of writ-ing, those specific to writing produced by English language learners, and the differences in English technical and non-technical language. We will describe each corpus in turn below; summaries of all English (and other natural language) corpora are located in Table 2.

8

	15

The question of technical and imperative language is also confounded with the possi-bility of restricted domain. Therefore we selected six corpora, three technical corpora, two non-technical corpora with potentially restricted domain, and a corpus of instructions in the form of cooking recipes. The two non-technical corpora came from literature: a corpus of Shakespeare’s works (Norvig, 2009), restricted in domain by having the same author, and a corpus complied of 20 classic science fiction novels from the Gutenberg corpus11, which all fall under the same literary genre.

For the technical and imperative language corpora, we selected a corpus of NASA di-rectives, a corpus of legal language, a corpus of commit messages, and a corpus of cooking recipes. The NASA directives were scraped from the NASA website. Directives share sim-ilarities with source code requirement documents, a written English equivalent to source code. Source code requirements explain in detail what is expected from a software applica-

	Casey Casalnuovo et al.

For our corpus of commit messages, we began from a sample of 200 of the top 900 most starred GitHub projects, coming from a dataset mined for a study by Kavaler et al. (Kavaler et al, 2017). Initial exploration into this corpus lead us to observe the frequent presence of URLs along with some automatically generated segments. To normalize these commit messages, we replaced URLs with a special tag, and then removed all lines starting with”git-svn-id”, as they represented a highly repetitive automated pattern not representative of real programmer written English. We then took a random sample of these commit messages to obtain a corpus of roughly equivalent size to all of our other specialized English corpora.

Legal language, like code tends to be prescriptive and precise. Just as code variables and functions regularly reference other parts of the code, so to do references within legal text. For this purpose, we downloaded the US Legal Code12. The US legal code consists of 54 major title sections relating to the general permanent federal law of the United States.

Studying the Difference Between Natural and Programming Language Corpora	17

Table 3 Summary of corpora token counts and vocabulary for the modified English and Java parse trees

	Java Trees	English Trees
All Tokens	11267469	11354764
Terminal Tokens	2191014	1740902
Simplified Non-Terminal Vocabulary Size	81	93

The redundancy of corpora can also be modeled using a variant of the Zipf plot (Zipf, 1949). In a standard Zipf plot, we count all occurrences of a word in a text and assign each word a rank based on frequency. The x-axis is the rank of the word, and the y-axis is its frequency. When plotted in log-scale, this relationship appears roughly linear. We modify this plot in two ways. First, we normalize the frequencies on the y-axis to a percentage to make different corpora more comparable. Second, we extend the idea of a Zipf plot be-yond merely individual word frequencies to word sequence frequencies. Counts bigrams, trigrams, or higher order ngrams, helps make the distribution of phrase usage more appar-ent. In more repetitive texts the most frequent phrases constitute proportionately more of the text. On a log-log plot, we can visualize this effect (roughly) as the power law slope of the data. More repetitive texts begin higher on the y-axis and descend more steeply. Once normalized, corpora with steeper slopes demonstrate a greater frequency of repetitive phrase use; those with shallower slopes are show greater innovation.

Using Zipf plots to assess corpus repetition averts some of threats from using language models. To use an LSTM, the vocabulary size must be limited by removing infrequent words, which would artificially affect results for these words. There is no such limitation in the Zipf plots, which increases the robustness of the overall observations.

	Casey Casalnuovo et al.

... InputStream in FileInputStream file ByteArrayOutputStream out ByteArrayOutput-Stream byte buf byte 8192 ...

... Now 175 staging centers volunteers coordinating get vote efforts said Obama Georgia spokeswoman Caroline Adelman ...

Pygments had classified as Token.Names 16 In particular, we called these primitives types open category to be consistent with how other program-

	19

3.4 Creating Equivalent Parse Trees in Java and English

While the syntax of Java and English strings can be unambiguously represented with a tree data structure, the trees themselves are quite different. First, Java grammar is explicitly defined, where as English grammar is at best an imprecise model of an evolving reality. Second Java parse trees are also abstract, and omit some tokens present in the original text: punctuations (e.g. l"{ }", ";", ".", "+", "-" ) and some reserved keywords. In contrast, the constituency parse trees of English are concrete, comprising all tokens in the original text. Thus, the vocabulary size differences could confound the interpretation of comparisons of repetition: lower vocabulary, more chance of repetition. Finally, the syntax trees in Java and English represent different granularities. In Java a complete AST describes an entire file; in English, the tree describes a sentence. Thus, the code ASTs are both encompass for tokens and have longer paths from the root to the leaves.

18
19 https://help.eclipse.org/neon/index.jsp?topic=%2Forg.eclipse.jdt.doc.isv%

20	Casey Casalnuovo et al.


RB				VBD NP NN .
RB	DT	NN	NN	VBD NP NN .

Soon	the	office	work	claimed	PDT	PRP$
Soon	the	office	work	claimed	all	her

#EnumDeclaration

#Modifier

#PT

#SimpleName

#PT

#ECD

#PT

public

enum

JavaDocOutputLevel

{

,

}

VERBOSE

QUIET