From Monolith to Data Mesh: DataHen's Evolution to a Scalable Web Scraping Infrastructure

DataHen has come a long way since its inception as a monolithic Ruby on Rails app capable of scraping only a few hundred websites. In its most recent iteration, we are proud to have built an almost-infinitely elastic web scraping platform that can handle thousands of scrapers simultaneously. In this in-depth blog post, we will dive deep into how DataHen's data mesh infrastructure has evolved, focusing on the role of PostgreSQL clustering and Kubernetes in achieving this impressive scalability.

The Journey: From a Monolithic App to a Scalable Web Scraping Platform

When DataHen first started, our primary goal was to create an efficient and reliable web scraping platform that would meet the needs of businesses looking to extract valuable data from websites. Our initial infrastructure was a simple Ruby on Rails application that was able to handle a limited number of web scraping tasks. As the demand for web scraping grew, so did the need for a more scalable and robust platform.

Challenges in Data Storage

At the beginning of our journey, we used a single PostgreSQL database as our primary data store. This worked well for a limited number of scrapers, but as the number of scrapers increased, we quickly encountered the "bad neighbour problem," where the usage of the database by one scraper would adversely affect other scrapers. This issue was further compounded by the inherent limitations of a single database instance in terms of storage capacity and processing power.

To overcome these challenges, we needed to adopt a new approach to data storage and management that would ensure the efficient and reliable operation of our growing number of scrapers. This led us to explore the possibilities of containerization and orchestration technologies, such as Kubernetes.

Kubernetes, Data Persistence, and PostgreSQL

Kubernetes is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications. It is an excellent choice for managing complex, distributed systems like DataHen's web scraping platform. However, Kubernetes has some limitations when it comes to data persistence.

In a containerized environment, data is typically stored in ephemeral storage, which means it is lost when the container is terminated or restarted. This presents a significant challenge for web scraping tasks that require long-term data storage and management. To address this challenge, we needed to find a way to combine the benefits of Kubernetes with the robust features and performance of PostgreSQL.

The Birth of DataHen's Data Mesh Infrastructure

To overcome the limitations of Kubernetes in terms of data persistence and to continue using PostgreSQL as our primary datastore, we decided to develop our own data mesh infrastructure. This innovative approach to data storage and management allowed us to create a highly scalable and elastic web scraping platform capable of handling thousands of scrapers simultaneously.

Our data mesh infrastructure comprises several key components:

  1. Dynamic Postgres Instance Creation: To ensure that each scraper job has dedicated resources and to enable almost infinite scalability, we designed our platform to start new instances of PostgreSQL dynamically whenever a scraper job begins. This is achieved by launching several Kubernetes pods to run the fetcher and parsers, as well as starting a new instance of Postgres for each scraper job.
  2. Data Persistence and Management: In our data mesh infrastructure, data persistence and management are crucial to ensuring the reliable operation of our scrapers. When a scraper job is paused, unused, or ended, the system dumps the database into a file, which is then uploaded to our file storage system along with the Write-Ahead Log (WAL) file. This ensures that the data is safely stored and can be resumed later.
  3. Database Restores: When a user resumes a scraper job, the system starts all the necessary pods, downloads the Postgres dump, and loads it into the database. This allows the scraper to continue from where it left off, ensuring seamless operation and minimal data loss.

Interconnecting Scrapers and Smart Routing

One of the key features of our data mesh is the ability for scraper jobs to interact with one another. For example, a scraper can retrieve data from one job to initiate another job. This functionality enables us to create a chain of scrapers that work together in a single, cohesive flow, providing greater efficiency and flexibility in our web scraping tasks.

To achieve this interconnectedness, we implemented a smart routing system that routes API requests between scrapers through a dedicated router. This router is responsible for efficiently and securely directing the traffic to the intended node within the data mesh. This smart routing system not only ensures seamless communication between scraper jobs but also helps maintain the overall performance and stability of the platform.

Monitoring and Scaling the Data Mesh Infrastructure

As our data mesh infrastructure continued to grow, it became essential to monitor its performance and resource utilization. To achieve this, we integrated various monitoring tools and techniques into our platform, including custom metrics and alerts, to ensure that we could proactively address potential issues before they impacted our scrapers.

One crucial aspect of maintaining the performance and reliability of our data mesh infrastructure is the ability to scale it according to the needs of our scrapers. Our platform is designed to be highly elastic, allowing us to add or remove resources as needed to ensure optimal performance.

The combination of dynamic PostgreSQL instance creation, Kubernetes-based orchestration, and smart routing enables us to scale our platform almost infinitely, limited only by the available resources and our willingness to invest in them. This elasticity has been instrumental in allowing DataHen to scrape and crawl some of the world's largest websites and meet the diverse web scraping needs of our clients.

The Impact of DataHen's Data Mesh Infrastructure

The adoption of our data mesh infrastructure, built on PostgreSQL clustering and Kubernetes, has had a profound impact on the scalability and performance of our web scraping platform. By overcoming the "bad neighbour problem" and enabling efficient interconnection between scraper jobs, we have unlocked new possibilities for web scraping and data extraction.

Our innovative data mesh and PostgreSQL clustering solution have empowered businesses to tackle even the most challenging web scraping projects, providing them with valuable insights and data to drive their decision-making processes.

Conclusions

DataHen's data mesh infrastructure, built on PostgreSQL clustering and Kubernetes, has allowed us to create a highly scalable and elastic web scraping platform capable of handling the world's largest websites. By overcoming the "bad neighbour problem" and enabling efficient interconnection between scraper jobs, we have unlocked new possibilities for web scraping and data extraction.

We are confident that our innovative data mesh and PostgreSQL clustering solution can help businesses tackle even the most challenging web scraping projects. If you have a complex web scraping requirement or would like to learn more about how DataHen can empower your data extraction efforts, don't hesitate to contact us. Our team of experts is ready to assist you in unlocking the full potential of web data for your business.

Ready to take on the web scraping challenges with DataHen? Reach out to us today, and let's discuss how we can help you scale your data extraction efforts to new heights!