What can big banks do for their internal database infrastructure?

Issues that big bank faces in internal database infrastructure and potential solution/trade-off

Introduction

The big banks this blog post talks about are the names you often hear like Goldman Sachs and JP Morgan. They are representatives of the complex system that finance firms implement, sometimes not the most efficient system. Many of these companies use Multiprotocol Label Switching (MPLS), an internal network that cannot be accessed by the outside internet. The reason behind this is to ensure max security with low latency.

How to reduce unnecessary duplication?

The database infrastructure is built within these systems and there is a redundancy issue. Usually, there is a central data source hosted on an on-prem database that has some data that everyone wants. Other than that, each business unit usually hosts its database on its on-prem database with a unique address. They have in-house engineers who manage and organize the data in a way that benefits the Business Unit and only this Business Unit. The consequence is that there are many replicas of the same database but with one or two more unique columns that work for this Business Unit. These redundant duplications can be avoided if the BUs work together to share these data.

Where can I find the unique identifiers?

The unique IDs to join databases are another issue. There are some common UIDs that multiple businesses are interested in, such as an entity ID. A use case will be joining an internal database from a team to the central database on the entity ID. In more complex cases, different databases might have no shared identifiers, and you’ll have to find someone who knows which tables to join. (And there are not many of them out of a company of 50k employees.) This is not a reliable way to join data. One possible way is to have a search algorithm within different databases to look for a column with high similarity. The algorithm can do a secure search by looking for where the data is but doesn’t offer direct access. If a team found a dataset that they can use for either a bridge they want to use to join two data, or just interested in the content of the data itself, they can request usage from the owner of that dataset. Therefore, the security of the data will not be compromised.

Who can I share data with? Who can I not?

Many of the bigger financial enterprises have businesses on both buy-side and sell-side in investment banking. There is an ethical wall where the two are not supposed to share information to provide insider trading. When it comes to database sharing, there is a grey area of how much of the information they could share. It comes down to what regulation determines to be is ok to share and what is not. Some background information such as company contacts can be shared but others like financial information will be risky when the federal regulators examine the firms at year-end. Unfortunately, there is no intelligent way to categorize what data is safe to share and what is not. The safest way is to create two separate databases to share for the buy-side and sell-side, although the trade-off will be extra manpower for engineers to build the product and double the ongoing maintenance.