Why Data Collection Relies on Speed

May 14, 2020 9 minute read

When collecting data, the only effective and robust method is to use machine learning. Having researchers search for keywords alone is inefficient, we’ve gone into detail on that here but there are other reasons as to why machine learning is so important for data collection.

Collecting Data

Collecting data across adverse media, sanctions, PEPs and RCAs is not an easy task and can be made far more difficult when it’s human-led. People can only work a limited number of hours per day, have biases on what’s relevant and are affected by global events such as suddenly being forced into remote working (or in some cases, not being able to work remotely due to security measures, or other technical constraints).

Moving compliance to remote working has created difficulty for some parts of the industry. Due to various concerns, some researchers have had to be furloughed, this is especially impactful when it comes to data collection. For some tools that rely on researchers to build their data, this has seen a 200% increase in false positives where providers have been unable to rely on machine learning.

Data collection with machine learning is a proven way to do it effectively and to ensure business continuity during times of crisis.

Keyword Filtering Alone

keyword filtering diagram

It can be difficult to visualize why keywords are ineffective on their own. When onboarding clients you need to be able to refer to an adverse media database that has no glaring errors or gaps.

Using keyword filtering means that only a select number of media sources are able to be analyzed (if you don’t want to be completely overwhelmed with results). Not all adverse media from these selected sources will include the risk keywords that compliance teams are searching for, therefore will slip through the gaps.

Once this limited set of media has been keyword filtered you now have adverse media results containing some true adverse media and some non-adverse media. As a compliance team, it’s akin to having a chainlink fence around your business – the only option for feasibly surrounding the sheer amount of data to be processed but easily penetrated.

For the adverse media that isn’t included in the selected sources to begin with, you won’t be able to detect it at all. You won’t even know that it exists. You’ve just left an open door for those entities to enter your business.

Keywords and Analysts

keyword filtering with analysts

The next step up from keyword filtering alone is to add analysts to the mix. Analysts are able to take the information at the chainlink fence stage and filter the data to help reduce the false positives, allowing a compliance team to more easily prevent these entities from being onboarded.

Unfortunately, this approach has its limitations. Analysts, by virtue of being human, cannot work endlessly and bring their own biases to filtering. They may miscategorize some articles as adverse or not adverse and those that are incorrectly deemed not adverse then become entities that can enter your business without further review.

Again as there’s a select limited number of media sources being analyzed, some of it will simply not be included in the analysis. But as analysts are involved, although the quality of data is higher than just using keyword filtering, fewer articles are examined. As a result, more entities are missed and able to enter your business without being noticed.

When attempting to create profiles for these entities, the tendency is to create duplicates. This is due to the difficulty of humans needing to analyze all existing profiles in the database in order to decide where to add data. This is opposed to an automated process that can easily analyze all new data in the context of existing data, without the human limits of how much can be handled at a time.

On top of these issues, if analysts are unable to work for any reason then the efficiency of the system plummets. Resulting in a slew of false positives and forcing compliance teams to choose between missing risk by being less rigorous, or damaging the customer experience by maintaining rigor but massively increasing onboarding times.

Machine Learning

data collection with machine learning

Machine learning has a clear superiority over the previous two methods. Using machine learning to create an adverse media database and filter entities means that we are not required to limit the number of sources we analyze based on the number of researchers we have. Our limit is reality – the actual media produced on any given day by all qualified media sources. There is no need to select just a portion of media to examine, all of it is available and under review.

The lack of human element at this stage also means that the adverse media is able to be identified constantly and without interruption. Regardless of events in the world.

Once the adverse media is identified, using machine learning which understands the true context of the article (and not just the presence of keywords), the information is extracted and assigned to profiles matching real world entities without creating duplicates (which would waste time for compliance analysts). Then when they attempt to onboard at your business they’re easily identified and rejected as necessary by compliance teams. It’s the equivalent of building a brick wall around your business protecting it from adverse elements.

Creating an Effective Adverse Media Database

Taking advantage of machine learning is necessary for data collection for many reasons. It creates an effective adverse media database by examining all the media available, it works constantly, it identifies the adverse media that matters and presents that media in a format that’s easily accessible, and able to be used in conjunction with our advanced name matching and personal identifier filtering.

Machine learning provides entity profiles that can rapidly be used to decide whether or not to onboard a customer according to the risk-appetite of the individual business. Without using a machine learning-based solution your business will never be able to identify all of the potential customers who pose a threat to your compliance obligations.

But there are other reasons for using machine learning-based data collection. It’s not just useful for adverse media products. Sanctions and PEPs information changes regularly and businesses need to make sure that their data is up to date to avoid a breach or from making overly risky decisions.

Sanctions Data

Sanctions data sees a constant stream of change. It’s a critical data input for screening that operates at a binary level, if an entity is sanctioned then the business cannot work with it.

However, despite this level of importance, there is no unified structure on how sanctions data is delivered to the businesses that need to be aware of it. Lists can be changed without notice, designations are at times assigned in an unstructured manner and amendments can be similarly chaotic in delivery.

The machine learning-powered data collection we use for collating and ordering sanctions data is more effective than any manual scraping of data. But when we use it in conjunction with manual review the two work together to create the most accurate and complete updates to the sanctions dataset at an unprecedented speed.

Manual Monitoring – Fingerprints

Fingerprints is a machine learning capability we use to monitor manual sanctions sources. It scans the source every couple of hours and sends a notification to trigger an update whenever something is altered on the website.

Manual Review Sanctions Collection

Once Fingerprints has triggered the update, we verify that only correct data in the correct structure is drawn into the production of the sanctions dataset through a proprietary manual review tool.

This allows us to identify, track, prompt and log the review activity. Using this process we have total control over the sanctions data prior to it reaching production which allows us to identify mistakes made by regulators if needed and prevent delays on manual sanctions source updates.

We can also prevent and control the deletion of sanctioned entities and update data in a timely fashion for all clients through manual monitoring. And our Fingerprints tool allows us to support multiple websites and sources.

Data collection is defined by its relationship to machine learning. Human intervention is a necessary step for ensuring its quality in some niche cases like with sanctions data, but without machine learning to monitor and react data changes no human-led research can compare with the speed, breadth, depth and accuracy of a machine learning data collection.

Ever-changing PEPs

Politically exposed persons (PEPs) are constantly changing positions, rank and introducing new players to the field. It makes for a data collection aspect of the industry that moves as fast as politics and is in frequent need of updates.

Critical PEP coverage is key to every financial service business. Using machine learning we are able to obtain the information for all 245 countries and jurisdictions for critical class 1 PEPs across the executive, legislative and judiciary branches of power as well as the relevant heads of military, police and fire services plus the members of bank boards.

This in itself is not where the value of machine learning comes in, the true value is in the speed of updates.

Machine learning allows us to monitor these positions continuously and be aware of changes as they happen. We also perform an automatic update every 30 days to ensure that nothing is missed. This means that unannounced changes are discovered immediately and in case of significant changes, such as an upset election, we’re able to have new and updated PEP data coverage in a matter of hours after the results have been confirmed.

When it comes to PEP coverage, in particular, speed is king. Move too slow and you’re using out of date data to accidentally onboard a client who’s too high risk for your risk-based approach.

Using up-to-date data is key to understanding your client effectively and knowing when you need to have them subjected to enhanced due diligence or spend days frantically reviewing their transactions to ensure there’s been no suspicious activity or having to file retroactive suspicious activity reports. All of which are actions that divert your compliance officers from doing the work they need to focus on.

The data collection your vendor does matters because if you’re working from bad data, it doesn’t matter how good your internal processes are. You’ll be starting from the wrong place and will never be able to comply effectively with your regulatory obligations.