Accelerate your thought leadership by contributing to our blog. Join our community of experts now!
At Databox, our mission is to help growing businesses leverage their data to make better decisions and improve their performance. We envision a future where every company, no matter the size, can harness its existing data to create more accurate marketing plans, sales goals, budget planning, and more.
Last year, we started tracking exciting breakthroughs in data analysis, AI, and machine learning that can help us make advanced analysis available to every business. To help us apply these new technologies to our customers’ needs, we formed a dedicated Data Science team. Being on the lookout for the latest trends and innovative ways to integrate advanced features into the product, the team helps advance our goal of delivering cutting-edge features to our users and elevating their overall experience to the next level. So far, this has resulted in our newly released Databox Analytics platform, which also involves the Performance Summary feature, responsible for summarizing the performance of our client’s key metrics that saves them time and effort previously used for manually sifting through a large amount of data.
Since summarizing an extensive dataset is challenging due to its complexity, we have turned to generative models for support and faced a pivotal decision: GPT-3.5 or GPT-4. This blog post examines and compares GPT-3.5 and GPT-4 and contrasts their capabilities through the lens of our newly released Performance Summary feature.
By now, most people are familiar with ChatGPT. In simple terms, it is a language model that understands and generates human-like text based on the input received. For the model to be useful, it had to be trained on vast amounts of data. ChatGPT was trained on a diverse set of datasets, including Common Crawl, WebText2, Books1, Books2, and Wikipedia. The diverse training set is a big reason behind the general-purpose nature of the tool.
GPTs, or Generative Pretrained Transformers, operate with ‘tokens’, the fundamental units of text, using them to predict and create coherent responses. During the training phase, the model learns statistical relationships between these tokens, allowing it to generate contextually appropriate text. When a user sends a prompt, this input is first broken into tokens, a process called tokenization. For example, the sentence “Databox is amazing!” will be tokenized into [D,atab,ox, is, amazing,! ]. The tokens are processed through the model’s neural network, using learned patterns to create a fitting and coherent response as a token sequence.
OpenAI, the company behind ChatGPT, provides a simple chat interface that simplifies access to AI models, featuring GPT-3.5 and the latest addition, GPT-4. Both models have taken the world by storm by supporting functionalities such as:
Aside from the chat interface, OpenAI exposes an Application Programming Interface (API) for accessing AI models developed by OpenAI. The API is simple to use while maintaining the flexibility required for advanced use cases. It allows teams of all sizes to focus on research and development rather than distributed systems problems. Many models are exposed via the API, the most important being an optimized variant of GPT-3.5 called GPT-3.5-Turbo and the newest addition to the OpenAI portfolio of models, GPT-4.
When enhancing our features with ChatGPT, deciding which model to integrate was about tradeoffs. To better understand the characteristics and limitations of each model, our first step was comparing some of their differences, as found in the API documentation and the GPT-4 technical report.
The comparison above suggests that GPT-4 is supported by a significantly larger architecture. As a consequence, GPT-4 does better on the HumanEval benchmark dataset as well as provides comparatively greater results in exams across a broad spectrum of scientific endeavors such as Medicine, Computer Science, Science, Math, and Law. The significant qualitative difference is a direct consequence of the larger model size. However, processing time and costs are significantly increased, which is reflected in the higher pricing and more stringent API restrictions.
The security of the data our users entrust us with is our top priority. This is even more important when we send user data to a third party. As of the time of this writing, OpenAI has strong policies in place to ensure user data remains secure and private for both GPT-3.5-Turbo and GPT-4. Security is ensured by encrypting data at rest using AES-256 and data in transit using TLS 1.2+. OpenAI does not train models on user data, and they have been audited for SOC2 compliance. See the OpenAI Security Portal for more information on data security, privacy, and compliance on the OpenAI API platform.
OpenAI determines rate limits according to usage tiers. The higher the usage tier for an organization, the higher the rate limits. Our current limits for GPT-3.5-Turbo and GPT-4 are displayed in the table below.
One should default to using GPT-3.5-Turbo if at all possible and utilizing GPT-4 if all options are exhausted.
Most of our clients use Databox for aggregating and visualizing their data and benefit greatly from quick data interpretation. However, having all their data in one place can be overwhelming for clients with a large amount of it. The dilemma shifts from needing more data to being inundated with too much of it, which is why we have created the Performance Summary feature. It helps turn a wealth of data into actionable insights, saving our clients time and effort. Instead of sifting through tons of metrics, Performance Summary provides our users with a concise summary of their performance.
To give the user a snapshot of how well a specific aspect of their business is doing, the Performance Summary consists of several components. The Generative AI-supported elements are:
Set up this way, the components of a Performance Summary offer a comprehensive view of business health and are integral in driving strategic decisions. However, the generation of such a multi-faceted report is complex. It requires not only the collation of data but also a subtle comprehension of how different metrics interact with and influence one another. Crafting a meaningful performance summary that conveys status, delves into specific performance metrics, and suggests practical recommendations calls for an advanced approach. This is where the capabilities of generative models become instrumental.
Summarizing large and diverse datasets is tough due to their complexity. Basic rule-based or machine-learning systems can’t grasp the context needed to connect important metrics for a meaningful summary. Simpler models find it hard to handle complex ideas where key info needs to be condensed effectively. Capturing industry specifics, like jargon and knowledge, is a challenge for basic algorithms, requiring many manual adjustments. Considering these challenges, using conventional methods for this feature is a tough sell.
The general availability of ChatGPT makes building features that require sophisticated contextual understanding possible. Because the model is trained on a massive amount of data, it is context-aware when analyzing inter-related metrics and KPIs. The model can abstract salient points from a large volume of data and add insight and recommendations based on good practices from the industry.
ChatGPT excels at making reasonable predictions even when data is sparse or incomplete. This is especially handy for Databox users navigating niche or emerging markets where data can be unpredictable. Additionally, ChatGPT is easily fine-tuned for specific needs and can be adapted to focus on particular industries or subjects, allowing Databox to customize the Performance Summary feature based on user feedback and internal analysis. This not only boosts accuracy but also ensures the summaries are more relevant to the user’s context.
To utilize generative models in the Databox microservice ecosystem, we have introduced a “Generative API” microservice, which serves as a specialized intermediary between internal services and the OpenAI API. The Generative API supports all generative model use cases and exposes them to any services requiring functionality. The diagram below demonstrates the outline of the solution.
This architecture highlights a separation of concerns where the Generative API focuses on providing computational intelligence while client services handle data management. It also reflects a scalable and flexible approach to integrating AI capabilities into existing systems, allowing Databox to leverage the latest advancements in AI while keeping the core services optimized for their primary functions.
When deciding which model to use to power our feature, there were multiple factors to consider. Quality, performance and scalability, cost, and security all played an important role in deciding which model is most suitable for the Performance Summary feature.
For us, quality refers to the ability of ChatGPT to provide clear and informative summaries, which depends on the steerability and factuality of the model. We focused on three critical things: acknowledging our instructions (steerability), giving accurate information (factuality), and providing output that does not cause harm (safety guardrails).
Through extensive experimentation, we have defined the criteria for the output for the performance metric summary as follows:
The criteria for the output of the suggestions are the following:
The criteria for the output of the trend value are the following:
Accounting for any single criteria is not a problem for any sophisticated LLM. Accounting for all of them together while still maintaining factuality is a very difficult task. For the suggestions and trend, both GPT-3.5 and GPT-4 are compliant. This is likely because the instructions for suggestions and trends are much simpler, and the steerability demands are consequently lesser. In the following table, we briefly describe the outcome of our experiments for the performance metric summary.
The difference in capability between GPT-3.5-Turbo and GPT-4 results from the difference in the underlying architecture. GPT-4, being larger, has higher demands in terms of infrastructure, meaning it is more expensive to run for OpenAI, hence the longer response times, higher cost per token, and stricter rate limitations. Let’s look at the response times first.
As expected, GPT-3.5-Turbo is faster at generating tokens at 23.05 milliseconds per token versus 55.36 milliseconds per token for GPT-4, making the former roughly 2.4 times faster. It is easy to see how, at even moderate amounts of completion tokens, response times can be quite high. This is additionally complicated by the rate limits introduced by OpenAI to reduce the load on the servers – serving too many requests concurrently per minute can result in an interrupted user experience.
We improve the user experience by:
As discussed earlier, GPT-4 is ten times more expensive per token than GPT-3.5-Turbo. Regarding cost breakdown, the prompt has two components: fixed and variable.
While GPT-3.5-Turbo is more economical, steering it can be challenging, requiring more tokens and advanced techniques for comparable results. Consider a specific example in the context of the Performance Summary feature.
The difference in input tokens and the number of requests made is because GPT-3.5 Turbo is significantly harder to work with. While with GPT-4 we can define all Performance Summary elements and the intricacies involved in a single request, we have to split the summary, recommendation, and trend into separate requests to improve the steerability of the model. Even considering this, GPT-3.5 Turbo is still significantly cheaper. For use cases with lesser demands on the reasoning capabilities of the model, investing some time in improving the output of ChatGPT-3.5-Turbo is often worth it.
Finally, to cap the monthly bill at a reasonable rate, it is prudent to set limitations on the product side. The key here is to determine “reasonable” use and the appropriate limitations so the user does not feel constrained by normal use while avoiding unnecessary issues regarding unexpectedly high incurred costs and load.
When it came to picking the right model for our Performance Summary feature, we had to find that sweet spot between the sheer power of GPT-4 and the cost-effectiveness and broader bandwidth of GPT-3.5-Turbo. Diving into the Performance Summary use case, it was unclear which model to choose without testing and reviewing our goals and limitations. If your main goal is cost efficiency and speed, GPT-3.5-Turbo might be the savvy choice, especially if your use case is straightforward and your software design is clever enough to handle the limitations. On the other hand, GPT-4 excels at delivering top-notch output with a better grasp of things, making it the go-to for cases that crave detailed and nuanced results, even if it means a bit more on the budget. Ultimately, what sealed the deal for us was delivering value to our users. GPT-4’s ability to enhance the user experience with more accurate and insightful Performance Summaries is also something we considered at Databox. As this field evolves at a rapid pace, we will stay on top of updates and adjust our strategy accordingly.
Exploration of ChatGPT models is part of a series of technical articles, that offer a look into the inner workings of our technology, architecture, and product & engineering processes. The authors of these articles are our product or engineering leaders, architects, and other senior members of our team who are sharing their thoughts, ideas, challenges, or other innovative approaches we’ve taken to deliver more value to our customers through our products constantly.
Aleksej Milosevic is a Product Scientist in the Data Science team, actively collaborating with the Product and Data Engineering teams to devise solutions that harness the power of data. His work focuses on building systems that integrate machine learning, extracting actionable insights that significantly contribute to the advancement of our products and the enrichment of the overall user experience.
Stay tuned for a stream of technical insights and cutting-edge thoughts as we continue to enhance our products through the power of data and AI.
Get practical strategies that drive consistent growth
Latest from our blog
Popular Blog Posts
POPULAR DASHBOARD EXAMPLES & TEMPLATES