home / project / news & events / partners / dissemination / downloads / links / contacts     


project

  • DataMiningGrid objectives and vision

The shift towards intrinsically distributed complex problem soving environments is prompting a range of new data mining research and development problems. These can be classified into the following broad challenges:

  • Distributed data: The data to be mined is stored in distributed computing environments on heterogeneous platforms. Consequently, development of algorithms, tools, and services is required that facilitate the mining of distributed data.
  • Distributed operations: In future more and more data mining operations and algorithms will be available on the grid. To facilitate seamless integration of these resources into distributed data mining systems for complex problem solving, novel algorithms, tools, grid services and other IT infrastructure need to be developed.
  • Massive data: Development of algorithms for mining large, massive and high-dimensional data sets (out-of-memory, parallel, and distributed algorithms) is needed.
  • Complex data types: Increasingly complex data sources, structures, and types (like natural language text, images, time series, multi-relational and object data types etc.) are emerging. Grid-enabled mining of such data will require the development of new methodologies, algorithms, tools, and grid services.
  • Data privacy, security, and governance: Automated data mining in distributed environments raises serious issues in terms of data privacy, security, and governance. Grid-based data mining technology will need to adress these issues.
  • User-friendliness: Ultimately, data mining in a distributed grid computing environment must hide technological grid details from the user. To facilitate this, new software, tools, and infrastructure development is needed in the areas of grid-supported workflow management, resource identification, allocation, and scheduling, and user interfaces.

In order to support the above listed challanges the following main objectives of the DataMiningGrid project were identified:

  • To develop grid interfaces that allow datamining tools and data sources to interoperate within distributed grid computing environments,
  • To develop grid-based text mining and ontology-learning services and interfaces,
  • To develop a testbed consisting of several demonstrator applications from a diverse set of sectors, including the bioinformatics, healthcare, and automotive industries, and
  • To align and integrate these technologies with emerging grid standards and infrastructures.

The project is structured into nine workpackages (WPs). WP1 to WP5 are concerned with development of data mining technology for Grid computing environments and WP6 is designed to demonstrate the developed technology on the basis of a selected and representative set of demonstrator applications (WP6) from different sectors. The three remaining workpackages (WP7, WP8 and WP9) are concerned with concertation, dissemination, awareness, exploitation and project management activities. The diagram in the following figure is intended to provide an overview of the various technological elements of the DataMiningGrid and the project’s workpackages.

work packages

The picture depicts the different methodology, technology and systems components (boxes) and the data flow among them. Each solid-lined box represents a distinct WP. The middle layer (between the horizontal dashed lines) in the diagram shows the grid technology to be developed. It comprises core grid data mining services (DataMiningGrid Data and Analysis Services, including ontologies and text mining), the DataMiningGrid Workflow Management and Middleware components. The bulk of work will be devoted to the corresponding WPs (WP2-WP5).

The application layer on the top of the diagram represents WP6. This WP will focus on a selected and representative set of applications from a diverse range of domains and sectors, including computational biology and bioinformatics, medicine, civil engineering and ecological modelling, a scientific digital library, text mining and complex monitoring problems demonstration.

 

Updated on January 14, 2005