Skip to main content
Go to the home page of the European Commission (opens in new window)
English English
CORDIS - EU research results
CORDIS
Content archived on 2024-06-18

Software Engineering Properties of Functionally Enabled Languages

Final Report Summary - SEFUNC (Software Engineering Properties of Functionally Enabled Languages)

SEFUNC Final Report
Georgios Gousios

Arie van Deursen

Delft University of Technology, The Netherlands

The advent of distributed version control systems has led to the development of a new paradigm for distributed software development; instead of pushing changes to a central repository, developers pull them from other repositories and merge them locally. Various code hosting sites, notably Github, have tapped on the opportunity to facilitate pull-based development by offering workflow support tools, such as code reviewing systems and integrated issue trackers. The SEFUNC project was focused on mining and analyzing distributed collaboration on social coding sites using functional as well as object-oriented programming paradigms.

In the context of the project, we created a large scale repository mining operation to retrieve all data available from the Github hosting site, and using it we analyzed distributed collaboration through pull-based development and visualized language ecosystems.

Large scale repository mining

A common requirement of many empirical software engineering studies is the acquisition and curation of data from software repositories. During the last few years, GitHub has emerged as a popular project hosting, mirroring and collaboration platform. GitHub provides an extensive web service, which enables researchers to retrieve both the commits to the projects’ repositories and events generated through user actions on project resources. GHTorrent aims to create a scalable off line mirror of GitHub’s event streams and persistent data, and offer it to the research community as a service.

The primary challenge for the data collection process is the Github imposed requests per hour limit for authenticated requests, while the event generation rate is already higher; given that a single event can lead to several (even thousands) of dependent requests, it is not practical to assume that a single Github account will suffice to mirror the whole dataset. For this reason, GHTorrent was designed from the ground up to employ distributed data collection. The event and data retrieval process have been split in two components connected together through a queue; this way event retrieval is isolated from data retrieval and both can happen in parallel by multiple accounts. Shared databases keep track of the data on both ends of the retrieval. To acquire more Github accounts, we introduced the "workers for data" program, where interested researchers provided their account credentials to the project in exchange for direct access to the live databases.

After collection, the data is provided back to the community through the project's web page(http://ghtorrent.org(opens in new window)). Currently, more than 2 terabytes of data is on offer in two database formats. The wealth of data enables researchers for the first time to do full population quantitative studies in several domains, including software ecosystems, distributed collaboration and repository mining. The project has been awarded the *Best data showcase award* at the 2013 Mining Software Repositories conference, for its innovative use of distributed crawling and the sharing of valuable data with the community.

Pull based development

Pull-based development is an emerging paradigm for distributed software development. As more developers appreciate isolated development and branching, more projects, both closed source and, especially, open source, are being migrated to code hosting sites such as Github and Bitbucket with support for pull-based development. A unique characteristic of such sites is that they allow any user to clone any public repository. The clone creates a public project that belongs to the user that cloned it, so the user can modify the repository without being part of the development team. Furthermore, such sites automate the selective contribution of commits from the clone to the source through pull requests.

Pull requests as a distributed development model in general, and as implemented by Github in particular, form a new method for collaborating on distributed software development. The novelty lays in the decoupling of the development effort from the decision to incorporate the results of the development in the code base. By separating the concerns of building artifacts and integrating changes, work is cleanly distributed between a contributor team that submits, often occasional, changes to be considered for merging and a core team that oversees the merge process, providing feedback, conducting tests, requesting changes, and finally accepting the contributions.

Within the context of the project, we performed the first large scale quantitative analysis of how the pull-based development model works. Specifically, we extracted data from 300 large projects (170,000 pull requests) and using statistical and machine learning tools, we examined the factors that affect pull request acceptance, rejection and the time required to do so. We found that the pull based development model offers fast turnaround, increased opportunities for community engagement and decreased time to incorporate contributions. We showed that a relatively small number of factors affect both the decision to merge a pull request and the time to process it. Our findings contain actionable items that can be exploited by teams and individuals to improve the efficiency of distributed collaborative projects.

Visualizing language-based project ecosystems

In the context of software analysis, the term ecosystem means a collection of software systems, which are developed and co-evolve in the same environment. An interesting partitioning of projects in ecosystems is that of ecosystems created by projects developed in the same programming language, thus permitting the visual comparison of, e.g. functional and object-oriented languages. On the collaboration level, language ecosystems are created by sharing developers among projects. To investigate the existence and evolution of such ecosystems, we created an interactive visualization(seen in the adjacent screenshot(opens in new window). Using it interested parties can go through millions of collaborations of developers among projects of one or more programming languages, and also investigate those through time.

Impact and Availability on Web

The project has generated a stream of follow up work. The dataset has been selected for the mining challenge of the 11th Conference on Mining Software
Repositories, the flagship conference in the field. It is currently being actively analyzed by researchers in more than 6 Universities, while papers from independent researchers have been already published. Analysis and visualization results produced through the project have been used as examples of cutting edge research in conference keynotes.

You can find more about the project, including the datasets, links to source code, visualizations and documentation through the project's website at http://ghtorrent.org(opens in new window)
My booklet 0 0