At Databox, our mission is to help growing businesses leverage their data to make better decisions and improve their performance. We envision a future where every company, no matter the size, can harness its existing data to create more accurate marketing plans, sales goals, budget planning, and more. Last year, we started tracking exciting breakthroughs in data analysis, AI, and machine learning that can help us make advanced analysis available to every business. To help us apply these new technologies to our customers’ needs, we formed a dedicated Data Science team. Being on the lookout for the latest trends and innovative ways to integrate advanced features into the product, the team helps advance our goal of delivering cutting-edge features to our users and elevating their overall experience to the next level. So far, this has resulted in our newly released Databox Analytics platform, which also involves the Performance Summary feature, responsible for summarizing the performance of our client’s key metrics that saves them time and effort previously used for manually sifting through a large amount of data. Since summarizing an extensive dataset is challenging due to its complexity, we have turned to generative models for support and faced a pivotal decision: GPT-3.5 or GPT-4. This blog post examines and compares GPT-3.5 and GPT-4 and contrasts their capabilities through the lens of our newly released Performance Summary feature. ChatGPT 101: A simple breakdown By now, most people are familiar with ChatGPT. In simple terms, it is a language model that understands and generates human-like text based on the input received. For the model to be useful, it had to be trained on vast amounts of data. ChatGPT was trained on a diverse set of datasets, including Common Crawl, WebText2, Books1, Books2, and Wikipedia. The diverse training set is a big reason behind the general-purpose nature of the tool. GPTs, or Generative Pretrained Transformers, operate with ‘tokens’, the fundamental units of text, using them to predict and create coherent responses. During the training phase, the model learns statistical relationships between these tokens, allowing it to generate contextually appropriate text. When a user sends a prompt, this input is first broken into tokens, a process called tokenization. For example, the sentence “Databox is amazing!” will be tokenized into [D,atab,ox, is, amazing,! ]. The tokens are processed through the model’s neural network, using learned patterns to create a fitting and coherent response as a token sequence. OpenAI, the company behind ChatGPT, provides a simple chat interface that simplifies access to AI models, featuring GPT-3.5 and the latest addition, GPT-4. Both models have taken the world by storm by supporting functionalities such as: enhancing chatbot performance, sentiment classification, summarization of diverse text types, assisting in software development tasks like explanation, generation, verification, and transpiration, simplifying content by creating outlines and extracting key information, augmenting and creating content, or providing proactive suggestions based on context. Aside from the chat interface, OpenAI exposes an Application Programming Interface (API) for accessing AI models developed by OpenAI. The API is simple to use while maintaining the flexibility required for advanced use cases. It allows teams of all sizes to focus on research and development rather than distributed systems problems. Many models are exposed via the API, the most important being an optimized variant of GPT-3.5 called GPT-3.5-Turbo and the newest addition to the OpenAI portfolio of models, GPT-4. Picking the Perfect Fit: Model Selection for Databox Integration When enhancing our features with ChatGPT, deciding which model to integrate was about tradeoffs. To better understand the characteristics and limitations of each model, our first step was comparing some of their differences, as found in the API documentation and the GPT-4 technical report. The comparison above suggests that GPT-4 is supported by a significantly larger architecture. As a consequence, GPT-4 does better on the HumanEval benchmark dataset as well as provides comparatively greater results in exams across a broad spectrum of scientific endeavors such as Medicine, Computer Science, Science, Math, and Law. The significant qualitative difference is a direct consequence of the larger model size. However, processing time and costs are significantly increased, which is reflected in the higher pricing and more stringent API restrictions. The security of the data our users entrust us with is our top priority. This is even more important when we send user data to a third party. As of the time of this writing, OpenAI has strong policies in place to ensure user data remains secure and private for both GPT-3.5-Turbo and GPT-4. Security is ensured by encrypting data at rest using AES-256 and data in transit using TLS 1.2+. OpenAI does not train models on user data, and they have been audited for SOC2 compliance. See the OpenAI Security Portal for more information on data security, privacy, and compliance on the OpenAI API platform. OpenAI determines rate limits according to usage tiers. The higher the usage tier for an organization, the higher the rate limits. Our current limits for GPT-3.5-Turbo and GPT-4 are displayed in the table below. One should default to using GPT-3.5-Turbo if at all possible and utilizing GPT-4 if all options are exhausted. Using ChatGPT for Performance Summaries Most of our clients use Databox for aggregating and visualizing their data and benefit greatly from quick data interpretation. However, having all their data in one place can be overwhelming for clients with a large amount of it. The dilemma shifts from needing more data to being inundated with too much of it, which is why we have created the Performance Summary feature. It helps turn a wealth of data into actionable insights, saving our clients time and effort. Instead of sifting through tons of metrics, Performance Summary provides our users with a concise summary of their performance. The Anatomy of Performance Summary Performance Summary feature To give the user a snapshot of how well a specific aspect of their business is doing, the Performance Summary consists of several components. The Generative AI-supported elements are: Performance metric summary offers a descriptive and inferential overview of the data. Discusses the highlights and lowlights of individual metrics and individual causes. It goes beyond mere numbers by attempting to discern the deeper meaning of the individual metric and, as a part of a metric group. Suggestions provide a list of suggestions tailored to the user. Trend, which involves a circular icon with either a green, orange, or red colored symbol, represents the cumulative progress of a subset of metrics. Set up this way, the components of a Performance Summary offer a comprehensive view of business health and are integral in driving strategic decisions. However, the generation of such a multi-faceted report is complex. It requires not only the collation of data but also a subtle comprehension of how different metrics interact with and influence one another. Crafting a meaningful performance summary that conveys status, delves into specific performance metrics, and suggests practical recommendations calls for an advanced approach. This is where the capabilities of generative models become instrumental. Generative Models for Performance Summaries: A Smart Choice Summarizing large and diverse datasets is tough due to their complexity. Basic rule-based or machine-learning systems can’t grasp the context needed to connect important metrics for a meaningful summary. Simpler models find it hard to handle complex ideas where key info needs to be condensed effectively. Capturing industry specifics, like jargon and knowledge, is a challenge for basic algorithms, requiring many manual adjustments. Considering these challenges, using conventional methods for this feature is a tough sell. The general availability of ChatGPT makes building features that require sophisticated contextual understanding possible. Because the model is trained on a massive amount of data, it is context-aware when analyzing inter-related metrics and KPIs. The model can abstract salient points from a large volume of data and add insight and recommendations based on good practices from the industry. ChatGPT excels at making reasonable predictions even when data is sparse or incomplete. This is especially handy for Databox users navigating niche or emerging markets where data can be unpredictable. Additionally, ChatGPT is easily fine-tuned for specific needs and can be adapted to focus on particular industries or subjects, allowing Databox to customize the Performance Summary feature based on user feedback and internal analysis. This not only boosts accuracy but also ensures the summaries are more relevant to the user’s context. The Role of the Generative API in Databox To utilize generative models in the Databox microservice ecosystem, we have introduced a “Generative API” microservice, which serves as a specialized intermediary between internal services and the OpenAI API. The Generative API supports all generative model use cases and exposes them to any services requiring functionality. The diagram below demonstrates the outline of the solution. Generative API – The focal point of the architecture, which serves as an intermediary between the Databox ecosystem and the generative model. Concretely, it abstracts the complexity of interfacing with the OpenAI API by providing a simplified and use-case-tailored interface for other services to use under the ownership of the Data Science team. Service X and Service Y – represent client microservices that require results of the Generative API. The Generative API does not have access to user and metric data, meaning each service is responsible for fetching data and ensuring the data is up to date. AMQP, Load Balancer – The service is exposed via asynchronous messaging or HTTP mechanisms. The HTTP mechanism is appropriate when a smaller amount of tokens is required and an immediate response is demanded. The asynchronous messaging (via AMQP protocol) is appropriate for lower priority requests, requests where immediate results are not necessary, or when a longer calculation is expected. Database – Represents repositories of user data. This architecture highlights a separation of concerns where the Generative API focuses on providing computational intelligence while client services handle data management. It also reflects a scalable and flexible approach to integrating AI capabilities into existing systems, allowing Databox to leverage the latest advancements in AI while keeping the core services optimized for their primary functions. Weighing Up GPT-3.5-Turbo Against GPT-4 When deciding which model to use to power our feature, there were multiple factors to consider. Quality, performance and scalability, cost, and security all played an important role in deciding which model is most suitable for the Performance Summary feature. Quality of ChatGPT models For us, quality refers to the ability of ChatGPT to provide clear and informative summaries, which depends on the steerability and factuality of the model. We focused on three critical things: acknowledging our instructions (steerability), giving accurate information (factuality), and providing output that does not cause harm (safety guardrails). Through extensive experimentation, we have defined the criteria for the output for the performance metric summary as follows: Limit referencing explicit metric values to a specified number to prevent information overload and maintain clarity. Convey information in a formal yet conversational manner. Do not explain or elaborate on the technical details of the data payload. Tag paragraphs with HTML markup. Tag metric names with HTML markup. Limit output to a certain character length. Output a valid JSON. The criteria for the output of the suggestions are the following: Generate X suggestions (where X is a number). Each suggestion must be a complete sentence. Each suggestion must be limited to a certain character length. Generate in the form of a list. Output a valid JSON. The criteria for the output of the trend value are the following: Output must be an enumerative value of either: “positive”, “negative,” or “neutral”. Accounting for any single criteria is not a problem for any sophisticated LLM. Accounting for all of them together while still maintaining factuality is a very difficult task. For the suggestions and trend, both GPT-3.5 and GPT-4 are compliant. This is likely because the instructions for suggestions and trends are much simpler, and the steerability demands are consequently lesser. In the following table, we briefly describe the outcome of our experiments for the performance metric summary. Performance and scalability of ChatGPT models The difference in capability between GPT-3.5-Turbo and GPT-4 results from the difference in the underlying architecture. GPT-4, being larger, has higher demands in terms of infrastructure, meaning it is more expensive to run for OpenAI, hence the longer response times, higher cost per token, and stricter rate limitations. Let’s look at the response times first. As expected, GPT-3.5-Turbo is faster at generating tokens at 23.05 milliseconds per token versus 55.36 milliseconds per token for GPT-4, making the former roughly 2.4 times faster. It is easy to see how, at even moderate amounts of completion tokens, response times can be quite high. This is additionally complicated by the rate limits introduced by OpenAI to reduce the load on the servers – serving too many requests concurrently per minute can result in an interrupted user experience. We improve the user experience by: Caching: We save and reuse responses of existing Performance Summary requests for given metrics. This reduces the request count and improves user experience Rate Limit Headers: OpenAI provides a rate limit header with each response, which we use to adjust how fast we ask for data, avoiding any issues. Retry Strategies: If a request fails, we try again using an exponential backoff with jitter. Scheduling: We generate high-priority Performance Summaries on demand and space out low-priority requests to avoid overloading the system. Cost of ChatGPT models As discussed earlier, GPT-4 is ten times more expensive per token than GPT-3.5-Turbo. Regarding cost breakdown, the prompt has two components: fixed and variable. The variable part includes request-specific data like metric details, data source type, and aggregated values. This part also encompasses few-shot prompting examples and additional context. The fixed part outlines the common ruleset for all requests related to the performance summary use case. While GPT-3.5-Turbo is more economical, steering it can be challenging, requiring more tokens and advanced techniques for comparable results. Consider a specific example in the context of the Performance Summary feature. The difference in input tokens and the number of requests made is because GPT-3.5 Turbo is significantly harder to work with. While with GPT-4 we can define all Performance Summary elements and the intricacies involved in a single request, we have to split the summary, recommendation, and trend into separate requests to improve the steerability of the model. Even considering this, GPT-3.5 Turbo is still significantly cheaper. For use cases with lesser demands on the reasoning capabilities of the model, investing some time in improving the output of ChatGPT-3.5-Turbo is often worth it. Finally, to cap the monthly bill at a reasonable rate, it is prudent to set limitations on the product side. The key here is to determine “reasonable” use and the appropriate limitations so the user does not feel constrained by normal use while avoiding unnecessary issues regarding unexpectedly high incurred costs and load. Reflecting on our choice When it came to picking the right model for our Performance Summary feature, we had to find that sweet spot between the sheer power of GPT-4 and the cost-effectiveness and broader bandwidth of GPT-3.5-Turbo. Diving into the Performance Summary use case, it was unclear which model to choose without testing and reviewing our goals and limitations. If your main goal is cost efficiency and speed, GPT-3.5-Turbo might be the savvy choice, especially if your use case is straightforward and your software design is clever enough to handle the limitations. On the other hand, GPT-4 excels at delivering top-notch output with a better grasp of things, making it the go-to for cases that crave detailed and nuanced results, even if it means a bit more on the budget. Ultimately, what sealed the deal for us was delivering value to our users. GPT-4’s ability to enhance the user experience with more accurate and insightful Performance Summaries is also something we considered at Databox. As this field evolves at a rapid pace, we will stay on top of updates and adjust our strategy accordingly. Exploration of ChatGPT models is part of a series of technical articles, that offer a look into the inner workings of our technology, architecture, and product & engineering processes. The authors of these articles are our product or engineering leaders, architects, and other senior members of our team who are sharing their thoughts, ideas, challenges, or other innovative approaches we’ve taken to deliver more value to our customers through our products constantly. Aleksej Milosevic is a Product Scientist in the Data Science team, actively collaborating with the Product and Data Engineering teams to devise solutions that harness the power of data. His work focuses on building systems that integrate machine learning, extracting actionable insights that significantly contribute to the advancement of our products and the enrichment of the overall user experience. Stay tuned for a stream of technical insights and cutting-edge thoughts as we continue to enhance our products through the power of data and AI.