Software testing of citation database systems reveals unexpected failures

Academics will need to rethink how they title their research papers after a University of Wollongong (UOW) led study found that papers with hyphens in their titles are credited with fewer citations than those without hyphens.

The researchers used an innovative software testing method named “metamorphic robustness testing” to verify the major citation database systems Scopus and Web of Science, and found problems with the way they compute citation counts and journal impact factors.

The citation count is the number of times an academic paper is referenced in the bibliographies of other academic papers and books. Journal impact factor is the average number of times that articles published in a specific journal are cited, and is used to measure a journal’s standing in its field.

The study analysed data from 140,000 papers from Scopus and 35,000 papers from Web of Science. Scopus covers the life sciences, social sciences, physical sciences and health sciences; Web of Science includes science, social science, arts and humanities.

Thousands of research papers are published each year. Some become influential in their field, others less so, based on the number of citations they receive.

Citation counts can influence an academic’s career and whether they gain promotions or not. They are also one of the measures used to rank universities by most university ranking systems.

The study’s lead author, Associate Professor Zhi Quan (George) Zhou from UOW’s School of Computing and Information Technology, said the surprising results were applicable to all faculties in any university.

“Our results question the common belief that citation counts are a reliable measure of the contribution and significance of papers, as they can be distorted simply by the presence of hyphens in paper titles, which has no bearing with the quality of research,” Professor Zhou said.

“They also challenge the validity of citation-based journal-level metrics, including the journal impact factors.”

Professor Zhou and his fellow researchers, UOW PhD student Matt Witheridge and Professor T.H. Tse from The University of Hong Kong, found the presence of hyphens in a title increased the possibility of it being copied inaccurately when it was cited. For example, a colon might replace a dash or hyphen, or a hyphen might be dropped from between two words.

This pointed to problems with the ability of Scopus and Web of Science to accommodate small errors in cited titles when counting citations.

“Our in-depth analysis revealed robustness defects in Scopus and Web of Science, resulting in erroneous citation statistics for papers with hyphens in the titles,” Professor Zhou said.

“Even a minor typo can cause serious citation indexing failures due to the software system’s inability to cope with data entry errors.”

Initial transcription errors are then amplified through second- and third-hand citations, when people copy inaccurate citations from another citing paper’s reference list.

The impact of hyphens in paper titles was found to occur across different research areas.

A 2015 study (Letchford et al., ‘The advantage of short paper titles’), found papers with shorter titles were cited more than those with longer ones. Their results were widely reported internationally in such venues as Science and Nature. However, the new study shows it is the number of hyphens in the title, rather than the title length, that is the dominating factor for citation counts.

“Usually, the number of hyphens and the title length are correlated, giving the misinterpretation that the citation counts depend on the title length,” Professor Zhou said.

The research team looked at whether hyphens affected citation counts in different disciplines (e.g. biosciences, chemical sciences, computing sciences, etc.). While there was some variation in the affect, it appeared in all disciplines.

The researchers further investigated the impact of hyphens in paper titles at the journal level. A field-wide study of software engineering journals revealed that higher journal-impact-factor-ranked journals published a lower percentage of papers with hyphenated titles.

The researchers chose to focus on the hyphen because it is an ambiguous character: in a citation database or a citing article’s reference list it can represent a hyphen, a subtraction or negative sign, en dash, em dash, horizontal bar, list icon, et cetera. At the same time, the number of hyphens in a title has no bearing on the actual quality of the paper.

In addition, while authors’ names are also susceptible to typos, researchers have more freedom as to how they title their papers.

“Our research provides practical tips for authors to write citation-robust titles to avoid potential citation errors,” Professor Zhou said.

While the study focused on citation database systems, the innovative testing method is applicable to any software systems that process large amounts of data. On the one hand, successful information access in the digital information age requires robust systems of indexing and abstracting; on the other hand, such systems are difficult to test and verify due to the sheer volume of data they process.

The research conducted by Professor Zhou’s team has provided a practical solution to such problems.


Metamorphic Robustness Testing: Exposing Hidden Defects in Citation Statistics and Journal Impact Factors’, by Zhi Quan Zhou, T.H. Tse, Matt Witheridge, is published in IEEE Transactions on Software Engineering.

The study was supported in part by an Australian Research Council Linkage Grant and an Australian Government Research Training Program scholarship.