home / project / news & events / partners / dissemination / downloads / links / contacts     


project

As the name of the project DataMiningGrid suggests, the aim of the project is to integrate two different kinds of technologies, data mining and grid computing respectively.

  • Data mining

An ever-increasing amount of digital data, generated in moderen computing environments demands an efficient management and use of stored data and in particular the transformation of these data into information and knowledge. Data mining (also known as knowledge discovery in databases) is the technology addressing this needs. Data mining technology is used for the automatic nontrivial extraction of implicit, previously unknown, and potentially useful information from data. In its typical form, it can be viewed as the formulation, analysis, and implementation of an induction process (proceeding from specific data to general patterns) that facilitates the extraction of information from data. Data mining technology is now recognized as a key computational technology, supporting traditional tasks such as analysis, visualization, design, and simulation. It has important applications in diverse sectors such as science, engineering, education, business, government, and manufacturing. The various data mining techniques differ in terms of the:

  • Types of information that is extracted (e.g. predictive models, association patterns, cause-effect relationships, detection of affinity similarity-based groupings, deviation detection)
  • Format of the induced information (e.g. rules, decision trees, correlation networks, association patterns, neural networks, matrices, visualization)
  • Types of data they operate on (e.g. digital images, text, discrete, continuous, sequence, temporal), and
  • Application domain for which they are developed (e.g. finance, engineering, science, life science, manufacturing, marketing)

Data mining brings together researchers from many disciplines, including statistics, machine learning, visualization and image processing, mathematics, database technology, software engineering, and others. Work in data mining ranges from highly theoretical mathematical work in areas like statistics, machine learning, knowledge representation, and algorithms to development of systems solutions for problems like fraud detection, cancer modelling, network intrusion, and information retrieval on the web. Increasingly, data mining is employed in classical scientific discovery disciplines, such as biological, chemical, physical, and social research, and a variety of other knowledge industries, including government, education, high-tech engineering, process automation, and so on. Thus, data mining technology will play an important role in structuring and shaping future knowledge-based sectors in Europe.

 

  • Grid computing

Today modern sciences, education, medicine, engineering, telecommunication and classical sectors such as marketing, retail, finance, manufacturing, and government are characterized by rising degrees of coordinated resource sharing among dynamic and geographically distributed individuals, institutions, and resources. Grid computing promises to become an essential technology capable of addressing the changing computing requirements of future distributed environments.

This technology could be viewed as a generic enabling technology for distributed computing of the Internet era. It is based on a hardware and software infrastructure that provides dependable, consistent, pervasive and inexpensive access to computing resources anywhere and anytime. In their basic form, these resources provide raw compute power (CPU cycles) and massive storage capacity (magnetic disk or other mass storage devices). These two Grid dimensions were originally dubbed Computational-Grid and Data-Grid, respectively. However, since the inception of Grid technology, the term resource has evolved to cover a wider spectrum of concepts, including “physical resources (computation, communication, storage), informational resources (databases, archives, instruments), individuals (people and the expertise they represent), capabilities (software packages, brokering and scheduling services) and frameworks for access and control of these resources (OGSA - Open Grid Services Architecture, The Semantic Web)”. Using a Grid to share resources, researchers and small enterprises can gain access to resources they cannot afford otherwise. Research institutes, on the other hand, can leverage their investment in research facilities by making them available to many more scientists.

Initially, the Grid research community focus was placed on Computational-Grids. These have today reached maturity. Toolkits such as Globus, Condor and AVAKI offer a wide range of services, ranging from job management, data transfer (for input and output), and various security primitives such as authorization and secure channels. Built on top of Globus services, a number of Data-Grid projects were initiated about three years ago. The most notable of these is the EU-funded DataGrid project (initiated 2001). The DataGrid today spans 40 different sites and is divided to three application domains: biomedical, particle physics, and earth observation.

Data-Grid systems today offer good solutions for file-based data access and data management problems: they allow a client to locate data, select data, takes care of data replication, etc. Thus, the functionality of a Data-Grid is similar to that of information retrieval systems. However, many scientific and commercial applications are highly dependent on data stored in more complex database management systems, providing more sophisticated access and processing of data. Therefore, recent research has been focusing on Grid database access and integration services. These developments and the ever-increasing need to exploit the growing amounts of data across many sectors, give now rise to the development of generic gGrid infrastructure (protocols, services, systems, tools) that facilitates the automated analysis and interpretation of large and inherently distributed data.

 

Updated on January 12, 2005