January 25, 2019

Six Rules for Enterprise Data Management

Ofer HARARI

Ofer HARARI
Product Manager – NLP and Knowledge Graph/Refinitiv

Share This Post

Most of the organizations are facing data management challenges. Multiple databases, with different content, duplicate content coming from various data sources, etc…

Organizations are struggling to derive value and analytics, or to properly search across multiple content, or to understand content dependency (i.e. if "X" event happened, then who and what is affected and what action should we take? Or if our organization needs to develop a new capability, how can we find what technologies are relevant and what other departments in my organization already developed, that we might leverage?

 

Knowledge Management is the art of deriving value out of data.

 

Data management depends on technology factors (database, storage, search engine, authentication, backup/redundancy, high availability, speed, etc etc..). This post is not addressing these tech factors, but focus on the higher level principle data strategy and data management. Here are six foundational aspects of any data management initiative:

1. Information Model
You need to model your data. Define the objects/entities you care about, the relationships between those entities and any associated attributes like topics/themes. For example, in suppliers data, a supplier/vendor would be a type of entity, with relationships to industries, or to products or to specific geographic areas of operation.

In the auto industry, supplier to Tesla (supplier relationship) would be Brembo. Brembo will have attributes (relationship) of geography (Italy), Industry (Automotive), Person (Chairman of the board Alberto Bombassei), etc.

Following a well-defined ontology will reduce data silos and ensure your data is more integrated/connected.

Make sure to define relationships that link objects across content sets (e.g. Person entity has an expertise-relationship to cardiac medicine). Relationships could be defined using scalable, dynamic schema-less approach (Linked Data http://linkeddata.org/) or using a more rigid schema definition.

2. Single source of truth
Similar or overlapping data can reside in different databases. Which data is correct? Which database/master to link/refer to?

If you have the same person name/entity appears in more than one source, and those sources/content-sets are not linked - then you have a problem.

A preferred way is to have one single people-master where all different sources who has some data on people, refer to the same person entity in the one single people-master (Examples of multiple data sources for a person: credit card data for a person, web-site visit stats of a person, mobile phone location data for a person, medical record data of person…. All those data-sets need to refer to the same person in the one single people master – the single source or the truth)

Different methods to collect, cross-reference (concord) data from various sources

    • Federation - this approach keeps multiple vertical separate data masters that link to one central people master. For example: Students master database, and Physicians master database could have the same person (who is a physician and also used to be or still is a student as well) - both master databases should link to the person entity record in the single central people-master)
    • Matching / Concordance scores – in every data collection and data mastering there is process of data disambiguation and matching from different sources into single source of the truth. For example record or social media profile of Bill Gates refer to this Bill Gates or to Bill Gates ,_Sr or to this Bill Gates?

3. Industry standards
When defining your information model, ontology, taxonomies, etc, it is recommended to use Industry Standards as your baseline. You can add on top of this your own unique specific relationship types and taxonomies, etc.
Examples of taxonomies: in the Biomedical industry taxonomy, in financial services: FIBO ontology, in the editorial journalism it is more of a taxonomy of News themes/topics, etc…

4. Machine-readable metadata
Every data has to have machine-readable metadata to make it findable, searchable and useful. There are many technologies and workflows to apply such metadata, either from fully manually, through semi-automated to fully automated. This is a practice by itself.

5. Content collection and automation
Collecting content, validating and curating and mastering the data (adding/deleting/updating the new data into the one single master) is all part of data management. For examples sourcing images data and properly applying tags and description to make the images searchable, or sourcing medical records from various physicians involves matching to the right patience identity, classifying the main symptoms, diagnosis and the treatments given

6. Organizational management and the human factor
Getting your organization buy-in is important for success. Moving toward proper Data Management culture requires participation of the entire organization and most important management backing. This would require hard decision on organization priority calls, and a long term vision.

There are many other aspects of Data Management
*Data Governance * central directory of data items * maintaining provenance and priority of data sources * entitlements * users quality feed back into masters * Data discovery workflows * intelligent search * subject-matter/content experts *  ontologies * taxonomists and more.

Data Management has to start from the organization top management and be seen as long term investment, and it is something every organization will have to go through sooner or later to stay in business.

Comments? You can contact me directly via my ExecRanks profile.

Share This Post