The Language of Programming: On the Vocabulary of Names

Nitsan Amit*, Dror G. Feitelson

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Most of the text in a computer program is composed of the names of variables and functions. These names are selected by one developer, and need to be understood by others. This is similar to the role of words written in natural language. But there are several marked differences between the names in a program and the words in a book. First, names are frequently composed of multiple existing words, in an attempt to capture nuanced meanings and intents. Second, because of the use of multiple words, names can be rather long. Third, conventions may also allow names to be very short, and many single-letter names are used. But despite these differences, the general statistics of names are rather similar to the statistics of words. Like words, the distribution of names is close to a Zipf distribution. Also, popular names tend to be shorter than rarely used names. However, the underlying vocabulary if different. The composition of words leads to a more diverse vocabulary that can grow without bounds. But if we look at the individual words used in compound names, we find a rather limited vocabulary. These properties help explain the predictability of software, and how it can coincide with the large variability of names. It also suggests that it may be beneficial to model programs at the level of individual words rather than at the level of source code tokens.

Original languageEnglish
Title of host publicationProceedings - 2022 29th Asia-Pacific Software Engineering Conference, APSEC 2022
PublisherIEEE Computer Society
Pages21-30
Number of pages10
ISBN (Electronic)9781665455374
DOIs
StatePublished - 2022
Event29th Asia-Pacific Software Engineering Conference, APSEC 2022 - Virtual, Online, Japan
Duration: 6 Dec 20229 Dec 2022

Publication series

NameProceedings - Asia-Pacific Software Engineering Conference, APSEC
Volume2022-December
ISSN (Print)1530-1362

Conference

Conference29th Asia-Pacific Software Engineering Conference, APSEC 2022
Country/TerritoryJapan
CityVirtual, Online
Period6/12/229/12/22

Bibliographical note

Publisher Copyright:
© 2022 IEEE.

Keywords

  • program lexicon
  • variable name
  • words distribution

Fingerprint

Dive into the research topics of 'The Language of Programming: On the Vocabulary of Names'. Together they form a unique fingerprint.

Cite this