Solutions by DataGym
DataGym's Showcase
The Technolgy Behind DataGym

DataGYM» Solutions » Implementation Issues

Issues in Implementing DataGYM

Application Rollout

To deploy DataGYM successfully, it is important to list the business requirements first DataGYM can be configured while it is being implemented, based on the business requirements. It is also possible to install DataGYM in its default version and later modify its features to achieve desired results. Sometimes this may be preferable, especially in a case where a business process is being re-engineered and the exact final process and requirement is not known. In other words DataGYM may be preconfigured to support a desired solution or can be made to evolve into the required form later.

While our roots in the Financial Services have resulted in a specific slant in our solution choice sets, the design of DataGYM allows us to configure it to achieve any market/industry focus. The following sections explain a sequence of processes and tasks, which may be individually configured to achieved the overall desired results.

Data Cleanup and Transformation

Getting rid of the Garbage:

Data, in order to be processed correctly, need to be cleaned up of all the Junk or Garbage that crept in during the process of Data Collection. Presence of such Garbage not only obstructs the underlying Data to be correctly parsed but also impacts the quality of Matching, by doing so.

Some typical examples of the presence of Garbage in Data -


The presence of two period symbols after MR and @ in the name are unwanted and are considered as Garbage. The above Data should be: MR ALFANSO GONSALVES in order to be parsed properly.

2. Address Data: 24! MAIN APOLL(O AVE

Here, the presence of the characters '!' and '(' are unwanted and need to be cleaned up. The above Data need to be transformed to: 24 MAIN APOLLO AVE in order to be parsed properly.

Handling Data in mixed case:

Data, constituted of characters in both Upper and Lower Case is not ideal for matching. We need to transform the entire Data in Upper Case or in Lower Case.

A typical example of Data in mixed case -

1. Name Data: Mr. Shanaz Alam

The above Data need to be transformed to: MR SHANAZ ALAM

Numbers in a string:

Numeric Characters are essential part of Address data. Components like Street Number or Apartment Number are generally expressed in numbers.

Consider the following example:Address Data: 12/1 MAIN ST APT23

In this case, it is required to insert a space between 'APT' and '23' in order to have this Address parsed properly.

We have several User Functions for the necessary Data Cleanup and Transformation.

 Back to top of page

Data Parsing:

DataGYM can parse a string of data, as it appears in a field in a physical record. It uses a human concept of deduction to determine from what is known of the data or substrings within the data to determine what the unknown pieces are. Over time a set of intelligences are stored for future deduction. Although the approach is based on AI principles DataGYM allows Users a finer level of control over what the machine is doing. The parsing of data can be done based on this type of key word definitions and their relations to each other or based on simple patterns. Specific manifestations of Data Parsing is provided although many more are possible:

Name and Address Parsing

Information contained in raw data can be divided into various blocks like, 'Information Related to Name' or 'Information Related to Address' and so forth. Such information, on the other hand, can be broken into various smaller units.
For Example, Name Information can be broken into various Components like, Title, First Name, Middle Name, Last Name Prefix, Last Name, Last Name Suffix etc.

As an example to this, please consider the name: MR ALLAN S DEVIDSON SENIOR. This Name can be broken into unit components as:

Title: MR
First Name: ALLAN
Middle Name: S
Last Name Suffix: SENIOR

Address Information, on the other hand, can be broken into: Street Number, Street Prefix, Street Name, Street Name Suffix, Street Type, Apartment, Floor, Post Box, Rural Route, City, State, Post Code etc.

As an example to the above, please consider the address: 32 BEADON ST, APT 2/1, CALCUTTA-700019
This Address can be broken into unit components as:

Street Number: 32
Street Name: BEADON
Street Type: ST
Apartment: APT 2/1
District: CALCUTTA
Post Code: 700019

The process of breaking up the information into such units or components is Parsing. This parsing routine mimics the corresponding rules that a human mind follows in order to divide the information into various related components. As soon as it encounters the Data to be parsed, it tries to match each of the entries found in the Data against a set of standard values of various components. It then uses a set of predefined rules to break up the Data into meaningful components.

We will explain this process further by the first example above in relation to Name Parsing:

As soon as the name, 'MR ALLAN S DEVIDSON SENIOR', is encountered, it is broken into smaller components. The components are: MR, ALLAN, S, DEVIDSON, SENIOR
Then all the above components are compared with the known and standard values corresponding to the underlying Market. Assuming that the above name is from USA, the process goes like this:

1> MR - Matches with the Title Table
2> ALLAN - Matches with the First Name Table
3> S - Is an initial that does not match to any of the standard tables.
4> DEVIDSON - Does not match to any of the Tables.
5> SENIOR - Matches with the Last Name Suffix Table.

Based on the above, it finds a pattern for this Name as:
Title, First Name, Initial, Unknown, Last Name Suffix

According to a predefined rule, it maps it to:
First Name, Middle Name, Last Name, Last Name Suffix.

This set of predefined rules as well as the standard values varies from Market to Market and depends mainly on the customs and tradition of the underlying region.

The set of common Street Types for USA is not the same as in Brazil.
For Example, 'R' is a common Street Type in Brazil. However, the same word, 'R' is not a Street Type in USA.

In addition, the order in which various Name components are written in the Name Information in the USA is not the same as written in Japan.

According to the common custom in Japan, Family Name comes before the Given Names in Name Information. However, in the USA, it is the other way round.

The Parsing engine of DataGYM is flexible enough so that the same engine can be used to parse Information for any block (Address, Name, Company, House etc.) provided we have the set of standard values for it's different components and the predefined rules as mentioned above.


Genderization is the Process of assigning Gender Code to a Record by studying the corresponding Name Data.

It works in the same fashion by which one can tell that a person with the name 'BRIAN LARA' is most likely to be a Male.
Such conclusions will, obviously, depend on the underlying market. To understand this, please consider the following Names and the corresponding Genders:

The First Name 'BRIAN' generally corresponds to a Male.

A Name with the Title 'MRS.' must correspond to a Female.

A Name with the Last Name Suffix 'SR.' must correspond to a male.

Gender corresponding to this name cannot be determined as none of the components of the Name is conclusive for determining the Gender.

From the above examples, it can be comprehended that some of the components of a Name, at times, can be used to determine the Gender. It also becomes clear that the Order in which these fields should be checked for determining the Gender. DataGYM combines the above mental process in to its Genderization routine. It stores the commonly used Name Components along with the plausible Gender codes. During the Genderization Process, DataGYM looks up these stored values for matches with any of the components of the Name and if a match is found, it assigns the corresponding Gender Code to the Record.

To its advantage, DataGYM has accumulated more than 25,000 common First Names for the US as well as the Brazil Market.
According to our earlier experiments with the above Markets, DataGYM can, for most the client files, Genderize most of the records.


Raw Data, at times, show a peculiar trend of displaying the information of one field in the placeholder of another field.
Hence, a Company Name may be found in place of a Name or Address or vice versa. DataGYM handles such issues by its Floating Routine. Floating Routine can also identify and move some components from Data to the corresponding field and thus, substitute the Parsing Routine (to be explained later) to some extent.

Floating Routine uses some standard and known values of various fields and look for the same in Data. If found, it picks up the relevant information from the Data and puts the same in the appropriate field. Floating Routine, basically, rearranges the data.

Example 1:

Name: DAVID & CO
Address: 12 MAIN RD, FLOOR 2, APT 22

Floating Routine will rearrange this as:
Company: DAVID & CO
Address: 12 MAIN RD, FLOOR 2, APT 22

Example 2:


Floating Routine will rearrange this as:
Job Title: CEO

Example 3:

Address Line 2: 1119 BLACK CANYON HWY

Floating Routine will rearrange this as:
Address: 1119 BLACK CANYON HWY

Example 4:

Address Line 1: TOTAL ALCHEMY
Address Line 2: MRS. SANDRA O'LEARY
Address Line 3: 12 OCEAN AVENUE

Floating Routine will rearrange this as:
Address: 12 OCEAN AVENUE

Successful handling of most languages:

DataGYM is designed internally to handle any data including Double Byte character set such as Chinese, Korean and Japanese Kanji. The intelligence and market specific transformations are stored external to the software making it easy for DataGYM to take any established process and transport it to work effectively in any other market.

HouseHolding - Finding Relationships

HouseHolding means matching various records and grouping the matched records together. In other words, HouseHolding brings out (or at least tries to bring out) the relationships that exist among different records.

A record may have several information like Name, Company, Address etc. Accordingly, one can find a match between two records corresponding to two persons working in the same company. These two records may have the Address or Name Information entirely different.

Consider the following examples.

Example 1:

Home Phone: 6742814536
Office Phone: 6742937061

Home Phone: 6742937061
Office Phone: 6742814536

Home Phone: 6742814536
Office Phone: Unknown

Note that the above three records, probably, correspond to the same person.

Example 2:

Record1: Name: JOHN PAUL ADAMS

Record2: Name: DAVE ADAMS
Address: 12 MUCIPAL DR., STE 2 1, RICHARDSON, TEXAS - 750804444

Note that the above records correspond to two persons who, probably, belong to the same family and live at the same address.

Example 3:

Record1: Name: PAUL CONNOLY
Office Phone: 7192356067
Company Address: 22 WEST END PKWY, SUITE 224

Record2: Name: JAMES P ORLANDO
Office Phone: 7192356067
Company Address: 22 WEST END PARKWAY, UNIT 224

Company: AARON CORP.
Office Phone: Unknown
Company Address: 22 WESTEND, STE 224

Here, the above three records correspond to three different persons, probably, working in the same Company.

From the above three examples, it becomes clear that various types of relations may exists among the records.

Some of these relations may be of interest for any particular application. DataGYM uses the concept of Groups to hold the HouseHolding Information regarding different type of relations separately. Each of these Groups, on the other hand, is constituted of one or more matching rules, which are selected by taking into consideration the detailed business requirements and various traditions prevalent in the underlying market.

Instant Matching - Rapid Identification

Problems related to Instant Matching can be of either of the two categories. In the traditional way, a list containing enough information about a set of prospects is required to be matched against a known set of records. Nature of the known set of records depends on the application.

As for example, if the list contains information relating to the persons applying for a loan, this known set of records might be the information relating to the known defaulters. The intention behind matching the two sets of records is to reject an applicant if the corresponding information matches with one or more of the defaulters. This known set of records is known as the 'vertical list' while the set of applicants is known as the 'horizontal list' (or simply, list).

DataGYM handles such problems using its Instant Match Routine.

Now, think of a scenario where loan applications are received from the WWW. Compiling and processing a list only after sufficient number of such records are received may prove to be too costly in the era of rapid customer service. Therefore, the decision has to be taken as soon as each of the records (applications) is received.

Instant Match Engine of DataGYM has the unique capability of handling such cases with ease by incorporating the concepts of Messaging. Only precondition for such decision making process is that DataGYM needs to be provided with the vertical list (and the related settings) before such processing begins.

Consolidation of a key information into a single record.

DataGYM environment saves all data from different data sources input to the process, Ultimately User want to combine the data from these various sources into an information rich record more effective for making business decisions and for establishing communication strategies.

DataGYM can physically combine data from all sources and produce one physical record from a group of related records. If data is not owned in the Organization, DataGYM can create a composite view of the record so no permanent change is executed on the record but an Organization can take full advantage of the combined data from different sources in its decisions.

 Back to top of page

Mainframe Data, MQ Messaging and Other Environments

Raw data is fed into the client database by external applications as and when needed. It is the task of DataGym to Consolidate and Household those data and to return the clean output data to the original client database.

Obviously, there are three parts in the whole process.

  • Importing data from CLIENTDB to DGDR
  • Consolidate and Household the data through DataGym
  • Export the data from DGDR to CLIENTDB.

Importing Data

DataGym can handle data from almost any kind of database. As a one time setting, a DataFile Object needs to be defined, pointing to the CLIENTDB, through the DataGym Front End. After that, whenever data needs to be imported into DGDR, that DataFile Object needs to be imported. According to the requirement, one needs either a full file import, or a partial data import. Everything can be accomplished through the user-friendly Front End of DataGym.

Manipulating Data

Once we have our DataFile Object imported, it is a matter of one or two button clicks to Consolidate and Household the data. Again, everything is done through the DataGym GUI.

Exporting Data from DGDR to CLIENTDB

The “Export” functionality of DataGym can export data from DGDR to external source in two forms. It can either export data directly to a client database, or it can generate an XML file from the DGDR. In either case, there are options to export either all data corresponding to a DataFile Object, or exporting only fresh / partial data. According to client needs, export can be done in two ways.

  • From the DataGym GUI - Export data, specifying the source, destination, and range of data.

  • Call the DataGym API from any external program.

  • The remaining phase of the discussion will concentrate on the DataGym API and how to invoke it for the desired result.

DataGym Export API

The export API of DataGym is a set of Java classes. The core logic is embedded inside these classes. Client programs need not deal with the internal logic and data intricacies of DGDR. They just need to incorporate the set of Export APIs. Any software capable of accessing Java classes can utilize these APIs. Setting up the appropriate export parameters, and then calling the required function, is enough to export data from the DataGym environment. The settings determine what type of data to import, whether to export all the data, and the range of data (if partial export is needed). There are two functions of main importance to the developer:

 Back to top of page

Reports and Quality Assurance

Various QA (Quality Assurance) features are integrated in DataGym.

To this extent, DataGym uses one of its most powerful features viz. Reports.Besides standard text based reporting format, DataGym also integrates reports in MS Excel format. DataGym lets the user keep a close watch on the Data right from the Import till the end. . .by allowing detail Reports to be generated after each of the steps. We have a set of Guidelines that describes the way to verify and crosscheck the quality of output and related parameter settings to quickly identify and solve any problem that may occur during the processing of any file.

As a part of the Quality Procedure, User can set some pre-conditions to reject any record at the Consolidation Level.

As for example, you may reject records with blank First Names. DataGym gives a unique opportunity to re-execute the processes with comparatively loose parameter settings on the rejected records and thereby include some of those for further processing. For example, you may, on a later date, decide to include those records with blank First Names for which Middle Names are not blank.

 Back to top of page

RDBMS Environment

DataGym commits to perform precisely to meet the heavy demands of its users. It has been built with extensive dynamism to handle even the molecular difference in its user needs and assures of itself being unbelievingly light on resource. The infrastructure and resources required for comfortable and successful implementation of DataGym is truly dependent on the volume of data required to be processed by its users as well as the Data Availability requirement, processing speed and of course balanced economy. DataGym as a set of tools leverage heavily on its internal architecture and optimized performance engine in co-ordination with world leading RDBMS, thereby bringing in a perfect balance in investment and performance. It relies heavily on its own internal design and its perfect synchronization with Internals, all possible advantageous features of the underlying RDBMS and Hardware Platform.

Planning for right RDBMS and the operating environment may start from a single processor based desktop and end up with heavily clustered multi CPU environment in a distributed database architecture, giving way to precise needs for Data Warehousing. DataGym is capable of matching the requirement of scalability and can operate under mostly all leading Operating System environments. Underlying factors for planning a perfect server configuration are mainly the intended volume of data to be processed through DataGym, the processing speed needed, the Up-time intended for making data available, the distributed operating environment of users, multi-threaded parallel file processing needs.

The dynamic functional design and its transformation into a RDBMS environment has been done with extensive importance to the objective of making DataGym stand strong on a Flexible Design ready to be implemented on almost all the leading RDBMS. Such design has been successfully transformed and implemented on Oracle platform and is capable of being transformed into a fully functional RDBMS environment on platforms like SQL Server, DB2, Sybase, etc.

Back to top of page

Careers    |   Privacy Policy    |   FAQs

Copyright © 2019 CIANT® Corporation • All Rights Reserved Worldwide.
All brands, trademarks, cases and articles are the property of their respective owners.
Powered by Plastic Creations.