A data grid is an architecture or set of services that gives individuals or groups of users the ability to access, modify and transfer extremely large amounts of geographically distributed data for research purposes. Data grids make this possible through a host of middleware applications and services that pull together data and resources from multiple administrative domains and then present it to users upon request. The data in a data grid can be located at a single site or multiple sites where each site can be its own administrative domain governed by a set of security restrictions as to who may access the data. Likewise, multiple replicas of the data may be distributed throughout the grid outside their original administrative domain and the security restrictions placed on the original data for who may access it must be equally applied to the replicas. Specifically developed data grid middleware is what handles the integration between users and the data they request by controlling access while making it available as efficiently as possible. The diagram to the right depicts a high level view of a data grid.
Middleware provides all the services and applications necessary for efficient management of datasets and files within the data grid while providing users quick access to the datasets and files. There are a number of concepts and tools that must be available to make a data grid operationally viable. However, at the same time not all data grids require the same capabilities and services because of differences in access requirements, security and location of resources in comparison to users. In any case, most data grids will have similar middleware services that provide for a universal name space, data transport service, data access service, data replication and resource management service. When taken together, they are key to the data grids functional capabilities.
Since sources of data within the data grid will consist of data from multiple separate systems and networks using different file naming conventions, it would be difficult for a user to locate data within the data grid and know they retrieved what they needed based solely on existing physical file names (PFNs). A universal or unified name space makes it possible to create logical file names (LFNs) that can be referenced within the data grid that map to PFNs. When an LFN is requested or queried, all matching PFNs are returned to include possible replicas of the requested data. The end user can then choose from the returned results the most appropriate replica to use. This service is usually provided as part of a management system known as a Storage Resource Broker (SRB). Information about the locations of files and mappings between the LFNs and PFNs may be stored in a metadata or replica catalogue. The replica catalogue would contain information about LFNs that map to multiple replica PFNs.