
Simplifying Data Ingestion into the Databricks Lakehouse Platform

As organizations move forward in a fast-paced environment, business analytics is a significant factor contributing to their success. The analytics field has evolved from just displaying facts and figures into a more collaborative business intelligence that uses machine learning to predict outcomes and assists decision-making.
Databricks has emerged as a leading platform, becoming the new hub for data for many organizations. This unified data analytics platform has been instrumental in simplifying the process of building big data pipelines, providing a collaborative workspace for data scientists, and enabling powerful machine learning capabilities.
Ingesting Data into Databricks
To unlock the full potential of Databricks and achieve maximum business value, organizations need to ingest vast amounts of data into the platform. This data can come from a variety of sources, including relational databases, cloud storage, data warehouses, and real-time streaming sources.
Organizations are increasingly leveraging a variety of databases to handle their diverse data needs. From transactional databases like MySQL and PostgreSQL to NoSQL databases like MongoDB and Cassandra, and cloud-based databases like Amazon DynamoDB and Google Cloud Firestore, the data landscape within an organization has become a complex web of disparate systems. While providing flexibility and specificity for different use cases, this proliferation of databases presents a significant challenge. Each database operates in its silo, leading to fragmented data views, difficulty in cross-database analysis, and increased complexity in data management.
By ingesting data from various sources, organizations can create a 'single source of truth' for their data, enabling comprehensive analytics and data science capabilities across all their data. Databricks' integration with Delta Lake also ensures reliability and performance at scale, providing ACID transactions and a unified process for batch and streaming data. This unification of data not only simplifies data management but also empowers organizations to derive more valuable insights, make data-driven decisions, and ultimately, drive business growth.
Within the Databricks product portfolio, Delta Lake has gained prominence as a central component that curates, refines, and aggregates the organization's data to enable near real-time business intelligence.
Delta Lake is an open-format storage layer that unifies all types of data for transactional, analytical, and AI use cases. Delta Lake is built on the medallion architecture data design pattern to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture from raw ingestion (bronze), filtering and cleaning (silver) to driving business level aggregations(gold). This simple data model is beneficial for driving business intelligence, and machine learning applications.

Challenges in Data Ingestion
Despite the numerous benefits, ingesting data into Databricks is not without its challenges. These can range from technical issues to organizational hurdles.
One of the main technical challenges is dealing with data in different formats from various sources such as databases, cloud storage, and streaming data directly. Each data source may require a different method of extraction and may have its own unique schema. This can make the process of data ingestion complex and time-consuming.
Databricks suggests multiple options to help customers bring their data into their Lakehouse platform. We compiled a list based on their documentation. Please note that this is not an exhaustive list by any means
- Autoloader: Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional setup (S3 or DBFS)
- COPY Into: A command to bulk copy the data from the staging location (like S3) to a Delta Lake table.
- Direct connection to sources
- Connecting to other streaming sources
- Third-party Partners: Partners like Fivetran ingest data into Databricks using a SQL end point
- Add Data UI (Still in public preview)
Let’s focus on the Autoloader as that is the option Databricks recommends to use.
Autoloader Architecture
A simplified picture of Autoloader architecture looks like this

This might work like a charm for sources like log files. However, if the source is a database or multiple databases, the architecture looks like the following.

The scenic route includes using a CDC tool such as Debezium or Fivetran, connecting to a message queue like Kafka ( when using Debezium), and loading the CDC data into Cloud Storage. This is then picked up by CloudFiles and loaded into Delta Lake.
This shows that one would have to set up different architectures for ingesting data into Databricks depending on the type of source data. The usage of connectors comes with its own set of problems.
One would have to set up different architectures for ingesting data into Databricks depending on the type of source data.
Issues with Debezium Connectors
Debezium is an open-source distributed platform for change data capture (CDC). It can monitor and record all row-level changes in your databases, and it supports a variety of databases.
- Debezium has issues around configuration complexity, resource utilization, snapshotting, handling larger tables, higher throughput, and a lot more.
- Debezium is designed to run as a set of Apache Kafka Connect compatible connectors. So you'll need to set up a Kafka cluster.
- Additionally, one would have to set up a different instance of Debezium for each source database, adding to the operational complexity.
- Guarantees only “at-least-once” message delivery which can lead to data duplication.
- If not paired with a local state, they can cause backlogged data pipelines resulting in data loss.
Issues with Cloud-based Connectors
Cloud-based connectors like Fivetran or Hevodata comes with a different set of issues.
- There is no dedicated state to store the data in transition. So, in case of a crash or unavailability of a source or sink, the pipelines could be backlogged or worse data can be lost.
- Most of these connectors don't enable real-time use cases. They either require an agent to be installed in the source and sink or poll at certain intervals to capture the changed data.
- Cloud-only offerings result in higher network costs. (This is in addition to the higher licensing costs due to the activity-based pricing)
More importantly, end-to-end monitoring and observability are lacking in both approaches. Each component manages its monitoring until the data is handed over to the next one. These problems make ingesting data into Databricks a complicated affair.
More importantly, end-to-end monitoring and observability are lacking in both approaches.
How does Grainite help with data ingestion into Databricks?
Grainite is a converged streaming application platform that unifies a message queue, an event processing engine, and a streaming database. The result is the easiest and fastest way to move data, perform complex and stateful transformations, store the data, and serve materialized views directly to downstream applications.

Grainite is a converged streaming application platform that unifies a message queue, an event processing engine, and a streaming database.
The unification of multiple capabilities into Grainite opens up a variety of possibilities.
- With Grainite one can ingest data from multiple sources such as databases, applications, or other streaming sources.
- Perform in-line stateful transformations, filtering, and joins.
- State-fully store the data and even provide the ability to query on the fly.
- Provide Exactly-once processing with strongly consistent consumer and producer cursors.
Grainite removes the need to deploy multiple products like Debezium and Kafka or set up an expensive cloud-based connector tool.
The advantages of going with Grainite are multifold
- One-stop shop to move enterprise data from sources directly into your Delta Lake without the need for intermediary storage or external connectors.
- Enables real-time data synchronization from various sources to Delta Lake.
- Zero data loss with guaranteed message delivery - enabling the highest platform reliability.
- Automatically resolves conflicts when consolidating data from multiple sources.
- Out-of-box and end-to-end data pipeline monitoring and observability.
- Deploy on-premises or in any public-cloud environment close to your source of data to avoid expensive networking costs.
Additional Benefit - Eliminate Autoloader and write directly to Delta Lake
Instead of maintaining different architectures for different sources, Grainite acts as an intelligent middleware to ingest data from any source into Databricks. However, that is not where the benefits end.
If you recollect, the goals of Autoloader were to
- Track and stream only incremental changes into Delta Lake.
- Handle schema changes
- Optionally provide alerts
However, this requires taking the scenic path of reading data from sources, performing CDC via connectors, storing into cloud storage like S3, reading via Autoloader, and then pushing it into the Bronze table.
With Grainite at the center, one can skip the need to include Autoloader and write data directly into the Bronze Table of Delta Lake.

- With the included database, Grainite can store the metadata about what files are read, what files are new etc.
- Grainite can monitor the incoming data for any schema changes. With the inbuilt transformation capabilities, it can even change the schema to fit the destination format.
- With the ability to invoke action handlers, Grainite can send triggers when necessary to alert about incoming files.
With Grainite at the center, one can skip the need to include Autoloader and write data directly into the Bronze Table of Delta Lake.
Summary:
With its robust capabilities and powerful tools, Databricks is well-positioned to help organizations unlock the full value of their data. However, to make the most of these benefits, organizations need to approach data ingestion strategically. This involves understanding the challenges involved, planning the data ingestion process carefully, and investing in the necessary resources and skills.
Grainite is an ideal middleware for ingesting data from multiple sources into Databricks. While this blog is focused on Databricks, Grainite can push data to any destination such as databases, data warehouses, data lakes, and applications. Grainite is highly cost-effective and can drastically reduce resource and operation costs while enabling the fastest time to market compared to any alternative solution.