Insights from Berlin Buzzwords 2023: A Data Science Review
Written on
Introduction
Berlin Buzzwords is a prominent conference in Germany that focuses on the complex processes of data storage, processing, streaming, and exploration. With a strong emphasis on open-source projects, the event creates an engaging atmosphere for developers, data scientists, and tech enthusiasts to share ideas and insights.
This year, I attended five sessions that covered a wide array of topics including the establishment of MLOps infrastructure for e-commerce, the advantages and disadvantages of semantic versus keyword search in the realm of Generative Pre-training Transformers (GPT), the nuances of hybrid search, the significance of observability and analytics in fraud detection, and strategies for effective technical communication to avoid common mistakes.
Here are my main takeaways from each presentation…
The Talks
Talk 1: Developing MLOps Infrastructure for Japan’s Leading C2C E-Commerce Platform
Speakers: Ryan Ginstrom, Teo Narboneta Zosa Recording: https://t.ly/2A2t
This session offered a deep dive into the process of constructing MLOps infrastructure to enhance Machine Learning (ML) capabilities at Mercari, Japan’s largest C2C e-commerce platform. The discussion included the implementation of a "learning to rank" (LTR) model and the progression from in-house systems to a modern MLOps framework, addressing the hurdles and achievements encountered along the way.
Mercari's approach to integrating Machine Learning into their existing search framework epitomizes the principle of “starting small, starting simple, and iterating.” Their initial strategy involved a seamless integration of ML into their search pipeline through the LTR method. Although not revolutionary, this approach was effective enough to validate ML's advantages to stakeholders.
Below is a snapshot of Mercari’s original search pipeline, designed around conventional keyword search techniques, specifically ElasticSearch:
As their infrastructure matured, they encountered specific challenges. Initially, they had their ML model integrated with the search server, which slowed down experimentation. This led to transitioning the model into a microservice and incorporating a feature store. However, they faced additional difficulties with A/B testing, a feature that business users required for further experimentation.
Alongside their evolving ML serving strategy, Mercari tackled intricate data engineering issues. Their initial data pipeline, developed using dbt, became overly complex due to repetitive SQL queries, highlighting the necessity for implementing best practices in data engineering. Promptly adopting these practices resulted in a more efficient pipeline, underscoring the importance of solid data engineering principles right from the project's inception.
Interestingly, Mercari opted to adopt a mature MLOps platform — Seldon — later in their journey, despite the intricate challenges they faced. Seldon provides comprehensive tools for ML model monitoring and was considered only when their homegrown system became unwieldy. In hindsight, I believe an earlier consideration of such a platform could have streamlined their process and potentially mitigated some challenges.
Below, you can see the final iteration of Mercari’s search pipeline with the MLOps platform Seldon integrated:
Talk 2: The Context of Semantic vs. Keyword Search for GPT
Speaker: Tudor Golubenco Recording: https://t.ly/qPpOG
In this enlightening session, Tudor Golubenco, the CTO of Xata, questioned the traditional methods of using Language Models (LLMs) for Question-Answering (QnA) on documentation data. While the usual practice combines an LLM with vector search, Xata's innovative approach integrates LLM with a keyword search mechanism for enhanced tunability.
Xata aims to empower users with QnA capabilities on their documentation by harnessing OpenAI ChatGPT to extract relevant context from user queries, followed by a keyword search, likely a BM25 variant, to retrieve pertinent documentation pages.
An illustration of their distinctive method is shown below:
Interestingly, it was the tunability of keyword search that directed Xata towards this innovative path. Keyword search offers a variety of parameters to adjust, such as boosters and column weights, enhancing relevance. The following comparison illustrates the tunability of both methods:
I find this approach promising, assuming there is a skilled team available to fine-tune keyword searches. However, maintaining a large dictionary of synonyms, often necessary in BM25 searches, can be tedious. The challenge is compounded when considering internationalization; without multilingual experts on the team, addressing a broad range of languages can be daunting. Overall, it’s exciting to observe the innovative combination of LLM and keyword search to enhance QnA on documentation data.
Talk 3: Embracing Hybrid Search
Speakers: Roman Grebennikov, Vsevolod Goloviznin Recording: https://t.ly/cczJV
The developers of the Metarank project, Roman Grebennikov and Vsevolod Goloviznin, presented an intriguing perspective on hybrid search. Traditionally, I associated hybrid search with the integration of BM25 search and vector search; however, their presentation suggested a broader interpretation. They showcased that hybrid search can involve an ensemble of models, demonstrating that various search methodologies can be combined to create a powerful ranking system that leverages the strengths of each approach.
At the core of their strategy is the Metarank project, a standalone component designed to re-rank search results generated by your search engine:
Their experiments utilized the enriched Amazon ESCI dataset, which included a diverse array of product-related metadata (accessible at github.com/shuttie/esci-s), providing a rich and realistic testing environment for e-commerce. They employed Metarank to assess various search methodologies against this dataset.
The essence of their methodology was the integration of different techniques. The hybrid model they introduced merges traditional text matching, Learning-to-Rank, and modern neural search methods, achieving performance levels that exceed any single method.
While the cross-encoder introduces significant latency, their hybrid approach demonstrated promising outcomes, especially when the document list to be re-ranked was kept concise.
Their work is available for the community (accessible at: https://github.com/metarank/esci-playground), and they also provided a live demo (viewable at: https://demo.metarank.ai/). This presentation offered fresh and valuable insights into hybrid search methodologies.
Talk 4: Detecting Fraud with Observability and Analytics
Speaker: Philipp Krenn Recording: https://t.ly/o4_Z
Philipp Krenn's engaging narrative revealed the unexpected hurdles Elastic faced when launching their annual competition to honor community contributions like pull requests, blog posts, and talks. The introduction of prizes such as MacBooks led to a surge of fraudulent entries, necessitating innovative detection methods to handle the influx of deceptive activities.
What made this discussion captivating was Elastic's use of its own tools, especially its observability features and Kibana, to identify fraudulent behavior. Through meticulous analysis, curiosity, and diverse data visualization techniques, they distinguished authentic contributions from attempts to exploit the system.
This session highlighted the advantages of merging search, observability, and analytics data to uncover correlations and achieve a comprehensive understanding of complex situations. The effective application of these capabilities demonstrated how well-utilized Elastic’s observability tools and Kibana can address unexpected challenges.
Conclusion
In conclusion, Berlin Buzzwords 2023 was a highly informative conference that explored various aspects of search technologies. If you were unable to attend, I highly recommend viewing the session recordings available on YouTube, as they provide a wealth of insights and innovative strategies in the field of search.