Article Preview
TopIntroduction
Software architecture is defined as the organization of a system embodying its components and their relationships. As software systems grow in size and complexity, it becomes hard for developers to keep architecture well-documented, and this phenomenon results in an architecture shift from its initial design. Most of the open-source projects lack architectural documentations and for these projects, code is the available documentation. So software architecture recovery is crucial for many reasons, to adapt a software system to changing requirements, to enable the reuse of components, and estimate the cost and risks involved in a change.
For this reason, huge research was carried out in this domain to recover the architecture of a software system, and architecture recovery is defined as a reverse engineering approach that aims at reconstructing architecture from the implementational view of software. Many techniques have already been proposed to recover the architecture of software and these techniques work on different types of input information. Depending on the input information used, these techniques are categorized as, structure-based techniques, semantic-based techniques, knowledge-based techniques (Kong et al., 2018). Structure-based techniques depend on the structure of source code to extract relations and group software elements based on structural dependencies using different clustering techniques. Semantic-based techniques depend on the textual information present in source code and documentation. These techniques try to form topics and group software elements into these topics. Knowledge-based techniques use various types of input information from software repositories viz; framework-related information, directory information, patterns, commits, and issues in version control systems.
In literature, the majority of architecture recovery techniques are either structure-based (Mancoridis et al., 1999) (Maqbool & Babri, 2004) (Andritsos & Tzerpos, 2005) (Wang et al., 2010) (Zhang et al., 2010) (Cho et al., 2019) or semantic-based (Kuhn et al., 2007) (Garcia et al., 2011) (Sajnani, 2012) (Link et al., 2019). Only a few techniques (Li et al., 2017) (Shahbazian et al., 2018) (Kong et al., 2018) (Guimaraes & Cai, 2020) exploit the available knowledge in software repositories and use them in architecture recovery. In software, readily available knowledge is its directory information, and only very few techniques (Kong et al., 2018) use this knowledge in architecture recovery. Most of the techniques use one or two types of input information in the recovery process. However, none of these techniques utilize structural, semantic, and directory information at the same time. Further, there is no proper study on how to extract available directory knowledge and integrate it with structural and semantic information for architecture recovery.
This paper aims to mine all needed semantic information, compute hierarchy-based directory dependencies information and integrate these with structural dependencies to recover the software architecture. Effective mining of semantic information including comments, identifiers, variables, class/method names as well as usage, is carried out and a new approach for extracting directory dependencies based on directory hierarchy is proposed. Various coupling schemes are formulated to evaluate the effect of using multiple dependencies in architecture recovery. These coupling schemes are also experimented with different sets of weights on three subject systems, to identify the best combination of weights for integrating dependencies. The main contributions of this paper include:
- 1.
Designing a new approach for computing directory dependencies from directory hierarchy by using distance and depth-based measures.
- 2.
Effective mining of all types of semantic information and empirically evaluation of the effect of using semantic and directory dependencies in architecture recovery by formulating six different dependency coupling schemes.
- 3.
Integrating all three dependencies in the best combination of weights based on experimentation.
- 4.
To study the effect of integrated dependencies in architecture recovery by using Complete linkage clustering.