Challenges posed by distributed data mining
The shift towards intrinsically distributed complex problem soving environments is prompting a range of new data mining research and development problems. These can be classified into the following broad challenges:
- Distributed data: The data to be mined is stored in distributed computing environments on heterogeneous platforms. Both for technical and for organizational reasons it is impossible to bring all the data to a centralized place. Consequently, development of algorithms, tools, and services is required that facilitate the mining of distributed data.
- Distributed operations: In future more and more data mining operations and algorithms will be available on the grid. To facilitate seamless integration of these resources into distributed data mining systems for complex problem solving, novel algorithms, tools, grid services and other IT infrastructure need to be developed.
- Massive data: Development of algorithms for mining large, massive and high-dimensional data sets (out-of-memory, parallel, and distributed algorithms) is needed.
- Complex data types: Increasingly complex data sources, structures, and types (like natural language text, images, time series, multi-relational and object data types etc.) are emerging. Grid-enabled mining of such data will require the development of new methodologies, algorithms, tools, and grid services.
- Data privacy, security, and governance: Automated data mining in distributed environments raises serious issues in terms of data privacy, security, and governance. Grid-based data mining technology will need to adress these issues.
- User-friendliness: Ultimately a system must hide technological complexity from the user. To facilitate this, new software, tools, and infrastructure development is needed in the areas of grid-supported workflow management, resource identification, allocation, and scheduling, and user interfaces.
In order to support the above listed challanges new technology and services are being developed. The DataMiningGrid project focus mainly on the development of the components and services belonging to DataMiningGrid Technology Layer (between the horizontal dashed lines in the below picture). Analysis services and Data Services are being developed which will make the use of different data mining algorithms (located in the layer below them) in distributed environment possible. Workflow Editor is being developed, which eases the hard task of integration and configuration of complex analysis tasks.
The application layer on the top of the stack represents a set of applications from a diverse range of domains and sectors, that will be used to demonstrate the benefits of the developed technology.
Updated on June 3, 2008