In the era of big data, where information serves as the lifeblood of digital interconnectedness, the ability to seamlessly harness and analyze data is paramount. Data, with its transformative potential, connects the world, enables deep analysis, and underpins critical decision-making processes. At Databox, our mission is to empower businesses to enhance their performance through robust data analytics. As we navigate the dynamic landscape of modern business intelligence, our focus extends to integrating with over 100 Software as a Service (SaaS) APIs. In this blog post, we invite you to explore the intricacies of our journey, revealing the methods and challenges we encounter while working with diverse APIs to retrieve user data and construct a product that thrives on the power of information. Let’s delve into the crucial convergence of data, technology, and the quest for excellence in business analytics.
At Databox, we integrate with a multitude of external APIs, which serve as the primary sources of data for either (1) preparing predefined metrics for our users or (2) enabling them to create their own metrics using our Metric Builder feature.
Each API integration within the Databox codebase comprises a manifest file, a metrics file, and the integration’s logic. The manifest file includes various sections and, among others, describes the integration’s metadata, connect flow for creating data sources, and API endpoints for data retrieval. The metrics file provides information on all specific metrics the integration offers, including their metadata, limitations, and rules for fetching underlying data. The integration logic is responsible for implementing request preparation, request signing with credentials, pagination, and data preparation from API responses.
The integrations’ manifest and logic are directly utilized by data platform services. Those services orchestrate the integration’s connect flow, overseeing the creation of connections and underlying data sources. Once a data source is created and a metric is utilized within the Databox app, the data platform orchestrates the preparation, signing, and execution of requests against external APIs and data preparation specific to each integration.
We integrate with numerous external REST APIs and databases. Data platform directly makes requests to REST APIs for the majority of Databox integrations, including Google Analytics 4, Hubspot, and Shopify. As some REST APIs and databases have specific requirements for requests or data extraction, the data platform is enhanced with specific proxy services, which simplify and generalize integrations with external APIs such as Google Drive API or Big Query API.
Despite our best efforts, the rigidity of API standards and the prevailing notion of the simplicity of REST APIs present challenges when dealing with over 100 sources. Non-compliance with standards, diverse authentication methods, unexpected data formats, and functionality limitations pose hurdles. To tackle these challenges, we’ve implemented a system that enables us to overcome issues with a minimal custom approach. The following sections will outline some of our best practices in integrating with external APIs and shed light on the challenges we encounter.
Connecting of data sources for each integration follows the connect flow detailed in the integration’s manifest. The connect flow enables us to obtain user authorizations for accessing their data and comprises three distinct steps: authorization, selection, and success. During the authorization step, we acquire the user’s consent to access his data. In the selection step, the user chooses a specific project from among multiple projects they have access to. In the success step, we retrieve all necessary metadata on the user’s data source, such as currency, timezone, or other pertinent information, ensuring an optimal user experience with data within our application.
The connect flow invariably begins with the mandatory authorization step and can be succeeded by any combination of the three steps. Our integrations support either OAuth2 or OAuth1 authorizations, facilitating the acquisition and subsequent refreshing of user-specific access tokens. Alternatively, users may securely share plain credentials, such as API keys or usernames and passwords, in connect flows for REST API or database integrations, which support simpler authorizations.
We implement various measures to manage sensitive user information securely, starting with the storage of encrypted credentials and extending to the logging and monitoring of all requests executed on users’ behalf.
The most crucial aspect of working with external APIs and users’ data is security. Users must securely grant us access to their data on one side, while, on the other side, we must ensure that the credentials we possess are safely managed, stored, and utilized.
The drawback of simple authorization methods, such as API keys and combinations of usernames and passwords, lies in their security limitations. Typically, such credentials provide long-lived and overly broad access to API resources due to a lack of granularity in scopes and permissions. If compromised in any way, the attacker gains access to all resources associated with those credentials. Despite these limitations, approximately 40% of our integrations still rely on these authorization methods since APIs don’t offer more secure alternatives.
While the OAuth 1.0 authorization method offers a more secure approach with its signature mechanism, OAuth 2.0 provides a balance between security and simplicity, making it more suitable for many modern use cases. We use the OAuth 1.0 authorization only on two of our integrations, and its complexity always poses a challenge when improving our code base.
OAuth 2.0 offers advantages over other authorization methods, but it’s crucial to implement it correctly on the API provider and integration sides. Despite the clear definition of the OAuth 2.0 standard, many API providers do not follow it entirely. Thus, we continuously face different non-standard variations, demanding accommodation of deviations from the standard, resulting in a more complex and less maintainable code base. Main deviations include (1) extra parameters needed in the authorization flow, (2) non-standard parameter names for well-defined parameters of the standard, (3) non-informative authorization-related errors and errors with invalid status codes, and (4) long-lived or forever-valid access tokens. Additionally, API providers do not always adhere to the principle of least privilege, not allowing users to grant applications the minimum permissions necessary, posing a security concern as we often have to request broader permissions than needed for data retrieval.
Despite metadata being practically as important as the data itself when interpreting the data, there are instances where we are unable to obtain valid or standardized metadata from APIs.
We have faced situations where we couldn’t get a unique identifier for the connected data source from the API, even though the unique identifier is the most crucial metadata for unambiguous identification. Moreover, there are cases where APIs return non-standard time zones or where integrated applications support multiple currencies but expose only a single currency as metadata on the data source resource through their API, not returning currency alongside the values in each API response.
Upon receiving triggers to synchronize users’ data for their selected metrics, we initially prepare the requests to be executed against external APIs. Our integration’s specific logic then intelligently merges multiple requests, especially those with overlapping date ranges, if possible. This optimization significantly reduces the overall number of requests executed against APIs, thereby enhancing efficiency in the synchronization process.
The logic for preparing requests relies on our generalized approach to describing APIs. Each integration’s API description outlines operations containing crucial information for the HTTP request, such as the HTTP method, endpoint, and other request parameters—whether dynamic parameters inside the endpoint, query parameters, headers, or request bodies.
We support one level of dependency between endpoints, which means we’re able to make additional requests before the main requests to gather all relevant parameters needed for the preparation of the primary request. This way, we offer a structured and manageable way to handle at least basic dependencies within API requests.
Pagination support is a key API feature we’re harvesting for navigating multiple pages of data and, as such, retrieving complete datasets for metric preparation. The modularity of our code base allows integration engineers to accommodate various pagination types by programmatically supporting the preparation of requests for pagination for each integration individually, ensuring adaptability to different API structures.
Once the requests are prepared, we sign them with credentials and execute them. Automation plays a pivotal role in refreshing short-lived credentials, such as OAuth 2.0 access tokens, as our dedicated workers handle this task for active data sources. We dynamically stop refreshing access tokens for data sources that are not in use, optimizing resource allocation by refreshing tokens only when necessary. In cases where outdated tokens are identified during data synchronization, we refresh them with on-demand requests. This ensures timely token renewal based on specific synchronization needs.
By adopting these best practices, we maintain a robust, consistent, and adaptable approach to preparing API requests, emphasizing simplicity, efficiency, and security across all integrations.
One of the challenges we face when working with APIs involves intricate dependencies between endpoints, necessitating the execution of multiple additional requests before we can prepare and execute the main request. This challenge is compounded by the frequent absence of analytics endpoints, which could expose more aggregated data and reduce the need for extensive pagination to obtain complete data sets.
Retrieving large data sets through APIs poses another challenge due to non-uniform, missing, or highly limited filtering options when requesting data. The absence of flexibility to filter only needed fields results in responses with higher complexity and size than necessary, often leading to extensive pagination for data retrieval. Additionally, APIs often lack options to request compressed responses.
A single API may expose endpoints implementing different pagination types, increasing complexity in pagination handling. In numerous cases, incomplete pagination metadata in responses is encountered, such as paginated responses lacking crucial metadata on the total number of records. We have even discovered cases of corrupted metadata in responses, leading to infinite cycling on one of the API’s endpoints with both supported pagination types. Despite reporting this bug to the API provider, they explicitly confirmed it but, unfortunately, closed the case without implementing a fix.
While some APIs offer endpoints for asynchronous data retrieval, this practice is not yet widespread. Asynchronously requesting data and polling for reports can be a best practice, especially when dealing with large data sets or when the preparation of data on the API side is time-consuming.
As an extension of the optimizations described in the request preparation step aimed at reducing the number of executed requests, we also have crucial practices in place to minimize the required number of requests made during the execution step against APIs. Additionally, we’ve established a rate-limiting handling system that enables us to efficiently utilize the entire quota of requests per time unit in cases when APIs are rate-limited.
All API responses obtained during request execution are compressed and stored in a short-lived cache. Upon receiving an identical request, we check the cache. If an API response is already available, we skip the execution of prepared requests, streamlining the synchronization flow and simultaneously reducing the load on our data platform.
Given the common occurrence of rate limiting in APIs, we’ve developed our own system for handling rate limiting. Initially introduced for the Google Universal Analytics integration, it has since been applied to other integrations. Although initially effective with request counting, comparison to defined quota limits, and delay calculation for request execution, we have recently enhanced our system for handling rate limits. It now possesses the capability to dynamically respond to the remaining tokens received in API responses, accommodating new quota limits handling on Google Analytics 4. As our custom rate-limiting handling system is a fascinating topic in its own right, one of our upcoming blog posts will delve into its intricacies.
The primary challenge in executing requests lies in the sheer volume of requests. To update users’ data promptly, if not in real-time, we must handle millions of requests daily. This necessitates an efficient and high-performing data platform capable of managing such a volume and dynamically responding to spikes. Additionally, it requires a team of highly skilled engineers who can proactively identify and eliminate potential bottlenecks and stay a step or two ahead of new product features, demanding more syncing and increased sign-ups.
APIs frequently lack support for asynchronous operations and polling. Asynchronous operations involve requesting data without an immediate response, allowing the client to periodically check for task completion. This approach is beneficial for large datasets or when data preparation on the API side is time-consuming. While we do support asynchronous operations and polling in some integrations, this retrieval method introduces several new and complex challenges. These include polling overhead, strategies to handle rate-limiting restrictions (such as adjusting polling frequency or implementing backoff mechanisms), concurrency and parallel processing, handling partial results, long-term operation considerations (like connection stability and client uptime), and the need for resumable operations in case of interruptions or failures. Effective monitoring and logging mechanisms are also crucial for tracking the progress of asynchronous operations, identifying issues, and gaining insights into performance bottlenecks or anomalies.
After successfully retrieving API responses, we process and save them to our file server. Processed responses are directed to the data preparation logic, implemented separately for each API integration. At its core, the data preparation logic extracts metric values, timestamps, and dimensions for each supported metric and forwards a predetermined data structure back to the data platform. The data platform manages data storage in the warehouse, completing the syncing flow. Our goal is to consistently match the data seen by users in the connected platforms within our application. Achieving this demands careful engineering and, at times, even seemingly magical skills. While our integrations have multiple components and serve as the core of Databox, the data preparation logic for each API integration is its own core and is undoubtedly the most challenging aspect to get right. It must perfectly support all date ranges, granularities, and visualizations.
To handle errors from APIs, we have a messaging layer in place. This layer allows us to extract crucial information from errored API responses and transparently communicate them to users. Errors are categorized as request, connection, or selection errors. Connection errors necessitate the reconnection of the data source, selection errors prompt changes in metric setup, and request errors indicate currently unsuccessful data retrieval, which will be retried during the next scheduled sync.
The key challenge in deriving valuable metrics from API responses for some of the integrations is navigating complex dependencies between data, compounded by the absence of analytical API endpoints providing processed values aligning with user-tracked data on connected platforms within Databox. While deciphering intricate relationships between data and preparing matching metrics is feasible in some cases, complexity escalates when integrating with inherently complex platforms, allowing myriad setup possibilities that result in non-deterministic metric values varying from user to user.
Often, data preparation challenges extend to handling history changes on the API side, understanding data manipulation, and addressing instances where raw data may be corrupted, either unintentionally or through API providers allowing such corruption. For instance, when an entity goes through multiple statuses, some providers may update only a single timestamp instead of adding timestamps for each update, hindering the availability of a comprehensive history. Exposure to anomalies or illogical data processing invariably raises questions, such as “What constitutes raw data?” or “What defines data manipulation?” It prompts considerations like “Is it acceptable to delete or update an event that occurred but is no longer desired?” and delves into ethical inquiries about the appropriateness of such event deletion or update.
Issues such as missing metadata in API responses, including currency and timezone information, suboptimal error handling, not ensuring meaningful error responses, and dealing with API docs that may not define all possible errors pose additional hurdles. Some APIs even return errors with a 200 OK status, complicating the interpretation of such simple metadata as the response status codes. Another aspect is the absence of response compression, requiring careful consideration for efficient data transfer. These challenges underscore the complexities of seamless and accurate data integration from diverse APIs.
Our general challenges in working with third-party APIs relate to communication with API providers, maintaining our integrations concerning continuous API changes and deprecations, access to sandbox or testing accounts, and the complexities of scalability testing due to the inherent dependency on external APIs.
When APIs function as described in their publicly available documentation, return accurate data, and their providers communicate all changes exhaustively and in a timely manner, there is no reason to establish additional communication channels with API providers. However, when we encounter issues and confirm that they originate from the API, effective communication and good relationships with API providers become of utmost importance to us and our users.
Based on our experiences, most API providers handle reports on issues with reasonable and expected priority. Although the timely resolution of reported issues or response to our inquiries is not always possible, eventually, API providers fix things or share information that allows us to resolve issues for our mutual users. However, throughout the years of working with numerous APIs, we have also encountered cases when evident and generally present API issues we reported were not taken into account, and API providers were not willing to fix them.
After deciding to integrate with a specific platform, our initial step is to subscribe to their API’s changelogs and other relevant channels. This ensures that we always have the latest information on updates, backward compatibility, deprecations, and sunsets. Generally, we receive crucial information on time. However, there are instances when we miss important updates due to API providers not sharing them or our need for awareness about the appropriate channels to subscribe to. Consequently, we must reactively resolve critical issues that catch us unprepared. These issues range from innocent cases like a single non-working endpoint to more severe scenarios, such as failing refresh of access tokens or even entire APIs being deprecated almost overnight.
The continuous testing of our API integrations, validation of data accuracy, and load testing pose significant challenges when collaborating with third-party API providers. Simultaneously, these challenges present opportunities for innovation. A prerequisite for development is access to a valid and operational test account, generously shared by our users, our proprietary account, or an account provided by the API providers themselves. Given the myriad setup options and features of modern platforms, resulting in intricate dependencies and non-deterministic metric values that differ between users, the task of encompassing all use cases and ensuring thorough testing and data validation becomes even more complex.
To date, we have not encountered an API provider offering foolproof testing accounts or sandboxes that authentically mirror real-world usage, facilitating seamless integration with their platform. It would be immensely beneficial if API providers included mock APIs as part of their development kit. This would enable us to execute their APIs with mock data on our end, utilizing them for end-to-end tests, as well as for conducting load and stress tests on our platform, all without the necessity of making calls to actual APIs.
In our vision, an ideal API is one that elevates its status beyond being a mere byproduct, recognizing it as a core product in the contemporary data-driven and digitally interconnected landscape. Such an API would adhere to industry standards and prioritize the development process with a relentless pursuit of quality. Its state-of-the-art documentation would start with OAuth 2.0 authorization and finish with a description of the endpoints offering reasonably aggregated data on one side and raw data on the other. To ensure a seamless user experience, it would boast dedicated support and engineering teams committed to addressing API-related challenges promptly and effectively. Communication would be a cornerstone, with advanced notifications for API changes and deprecations delivered through a unified channel. Additionally, the API provider would establish an open line of communication for issue reporting and queries. Going a step further, the provision of sandbox or test accounts mirroring the diverse data relationships within their platforms would be a pivotal feature. Importantly, the API would enforce best practices in data handling, prohibiting data manipulation and fostering a secure and trustworthy environment for users.
While our current API integrations rely on on-demand REST API calls and fetching datasets from SQL data sources, we are actively progressing into the development of advanced data preparation pipelines. A key pipeline in the works is designed to facilitate complete replication and continuous updates of data from sources, coupled with the dynamic creation of metrics atop this replicated data. This strategic shift addresses multiple challenges inherent in our current pipeline, notably the high volume of API calls, recurrent breaches of rate limits, and the volatility of data on the API side.
Anticipating that new solutions inevitably introduce their own set of challenges, we are confident that our ongoing efforts to mitigate and eliminate existing obstacles will pave the way for the generation of even more reliable data for our users. This marks a significant stride toward achieving operational excellence in our data preparation processes. Stay tuned for all the groundbreaking developments coming your way.
Breaking down the challenges that come with the integration of over 100 SaaS APIs is part of a series of technical articles that offer a look into the inner workings of our technology, architecture, and product & engineering processes. The authors of these articles are our product or engineering leaders, architects, and other senior members of our team who are sharing their thoughts, ideas, challenges, or other innovative approaches we’ve taken to constantly deliver more value to our customers through our products.
Uroš Trstenjak, PhD, is an Engineering Manager in the Data Engineering team, focusing on implementing scalable data infrastructure. At Databox, Uroš is involved in strategizing and implementing solutions to optimize data-related processes, contributing to the overall success and effectiveness of the Data Engineering team.
Stay tuned for the latest technical insights and cutting-edge ideas in data engineering.
Engineering Manager at Databox
Get practical strategies that drive consistent growth