Why Our Data Is Different: Keeping Up With the Churn

November 14, 2015
by
Kevin
@ Etic Lab

The real world is a messy place. Information becomes out of date almost as fast as it is acquired. Our system is designed from the ground up to deal with this reality “ the modern era of almost constant change. The core task of acquiring, cleaning and then integrating all of the different data sources goes on 24 hours a day, 365 days a year.

Companies start up and cease trading in their hundreds. In 2013-2014 we found that 670 businesses ‘died’ every day; in fact, one in seven of accommodation and food service businesses die every year. Many thousands more companies are registered and never trade at all.

Companies can have multiple names.
Addresses for trading and company registration that are different.
Multiple trading addresses or only a cell phone number.
Websites, Twitter accounts, Facebook pages and Yellow Pages listings.

Our system is designed from the ground up to deal with this reality – the modern era of almost constant change. This means that we do not set out to give a black and white 100% accurate picture of the current state of affairs, not least because there is no such thing. But also because we want to deal with the nature of what is happening so that we can add value for our customers seeking immediate access to useful data.

Where does our Data come from?

We collect and interpret the information that is available and provide, using artificial intelligence techniques, data on trading businesses that is as consistent and reliable as we can make it. If we are not 95% confident that a business exists and is trading from where we say it is trading, we do not sell the information. It is as simple as that.

All of our data starts life in the public domain but we add to it in a large number of ways in order to make it accessible and useful. This means we do not buy any data such as credit ratings from third parties, nor do we telephone thousands of companies to find out if they are there that day. Given the scale of change; turnover in staff, locations, business failures and the sheer number of businesses involved, such techniques are intrusive and limited in their utility.

Data Quality

Our clients have complex data requirements and I order to meet these expectations we had to design a sophisticated system and collect data for years before we could begin offering high quality products.

In order to be able to ‘find’ new businesses we first built the tools and processes to acquire data from a variety of sources. Having acquired the data, it then needed to be processed using Big Data techniques to create a data warehouse where all of the various types of information could be brought together in a way which allows us to ask useful questions. This core data warehouse can be used in a large number of different ways.

To give an example, NATE Quarterly – Newly Actively Trading Enterprises & Businesses in the UK, in itself is a single batch of queries or data requests made using that warehouse. In essence we ask the system to list every Trading Entity where the system has acquired the required evidence for a name, activity, location, and contact details within the previous 3 months.

24 hours a day | 365 days a year

The core task of acquiring, cleaning and then integrating all of the different data sources goes on 24 hours a day, 365 days a year. We are monitoring public datasets such as the companies house register, UK Internet activity, social media and visiting many thousands of business web sites every day. These data are then processed using a variety of techniques and tools to create a picture of the activity of more than 3,000,000 trading entities in the UK.

We have been actively curating the Data Warehouse since August 2014. Every day this is being maintained by data checking and harvesting using more than a dozen different programs running continuously.

To discuss the ideas presented in this article please click here.