An advanced data ingestion platform to handle massive multi-stream data feed to a powerful data warehouse of a global financial information and software company
Thomson Reuters is a global financial information and software company serving more than 20 million business people every day. The Thompson Reuters Data Warehouse (TRDW) takes feeds from multiple securities data vendors, scrubs it and delivers it in the form of market indices and business intelligence to customers. Strategic Systems International (SSI) was for years a development partner for Thompson Reuters.
Thomson Reuters turned to SSI for a new data ingestion system to handle input from its many data vendors to the TRDW. Each vendor implementation would require the creation of custom applications. Data arrived in a wide array of formats, such as Excel, .txt and .csv, as well as custom formats. Uploads for some vendors were infrequent, while others updated every few minutes. A number of vendors had legacy problems with uploaded data, including anomalies such as duplicate records and null or empty values. In some cases uploads amounted to multiple terabytes of data needing to be processed and transformed in a matter of minutes, while maintaining data quality and accuracy. Business rules needed to be created and applied. Throughout, coding optimization had to be a priority.
The data ingestion project was not simply a matter of applying state-of-the-art technology to handle a massive and multi-stream flow of data. A new model for data ingestion was created. New business rules for the entire extract, transform and load (ETL) process were created. In the final phase of ingestion and post-processing, complex business logic formulas had to be created and applied, such as in creating indices. Quality assurance involved verifying business logic and validating ingestion results. The update and post-processing phase converted the data into an update file that was packaged and licensed, after which it was ready to be published for customers. SSI also developed a communicator application for Thomson Reuters’ customers to use to download files and insert them into their own databases.
The data processing phase was highly optimized and was able to meet Thompson Reuters’ performance and quality expectations. The experience with data ingestion from one vendor, MSCI Barra, illustrates the success of the SSI engagement. MSCI is a leading source of global index data in the equity and real estate investment trust arenas. The MSCI project required ingesting historical files starting as far back as 1969 all the way to present day. The timelines for the MSCI project were aggressive, as Thomson Reuters had to meet stringent business commitments. It took two to three months to complete the first phase and a year for the giant project overall, but timelines were met.