Article Preview
Top1. Introduction
Reliability is the basic demand for any form of storage. Reliability in the context of data storage implies preserving the original form of data as it has been stored in the system by the subscriber. Loss of integrity of stored data leads to data corruption and hence low reliability. Data corruption can be attributed to both malicious as well as inadvertent corruption due to unavoidable reasons. Separate mechanisms are in place in the domain of information security to deal with malicious corruption of data. However, even in the absence of malicious attacks, data are likely to get corrupt over a period of time. So, error correction codes are used to recover from such eventual data corruption and to maintain high reliability. However, error correction codes occupy considerable storage and therefore they contribute to storage overhead. A naive error correction mechanism requires excessive amount of storage just to store such codes. Thus, efficient error correction codes are required to achieve high reliability with low cost of storage. In distributed storage systems, apart from clever error correction codes; reliable low-cost storage can be achieved by exploiting numerous other parameters, unique to such systems. Distributed storage requires data to be stored in a set of storage nodes distributed across a network. In such systems, storage options such as changing allocation pattern in nodes, option to locally or globally repair a problem, etc. become available. These options can be intelligently exploited to improve reliability without having to incur excessive storage cost. Distributed Storage Allocation is one such option available in distributed storage systems that may ensure high reliability in storage systems while minimizing the storage cost.
The cost factor in distributed storage allocation is defined in terms of total storage budget. Such cost factor is based on the core assumption that the communication links between the nodes are able to support the amount of data on the respective storage nodes. In cases where the communication link capabilities are not sufficient, total storage budget is limited and defined in terms of the communication link capacity. The limitation of existing works is that they choose one between two equally important cost factors viz. storage and link capacities. Ideally, a storage network administrator likes to minimize both the storage and communication costs together, if possible. An optimization problem which considers both of them as costs and works towards their minimization will be a novel one.
To the best of our knowledge no work has been done to study the network performance while varying the storage allocation pattern and spread. Our intuition is that network performance will vary noticeably with change in network allocation pattern and spread. A simulation-based study to explore the network performance by varying storage pattern and spread has been carried out and presented in this paper which validates this speculation. Motivated by the results of the simulation study, we have proposed a data placement algorithm which allocates data on distributed storage nodes that optimizes both storage and link capacity costs.
Based on the discussions, the objectives of the paper may be stated as follows:
- •
Study network performance through simulation by varying the network allocation patterns. We verify whether allocating equal amounts of data to each node of the distributed system (symmetric allocation) gives better performance than allocating unequal amount of data to each nodes of the distributed system (non-symmetric allocation). We also investigate whether allocating data to a few nodes (minimal spread) is better in terms of network performance than allocating data to a maximum number of nodes (maximal spread);
- •
Formulate an optimization problem which can jointly optimize the essential cost factors viz. required storage and link capacity; while maintaining high reliability;
- •
Propose an intelligent data placement algorithm which allocates data to storage nodes such that minimum storage space and link capacity will be required; while maintaining high reliability.