Home

Derived Data Platform for Planet-Scale Workloads

Venice is a derived data storage platform, providing the following characteristics:

High throughput asynchronous ingestion from batch and streaming sources (e.g. Hadoop and Samza).
Low latency online reads via remote queries or in-process caching.
Active-active replication between regions with CRDT-based conflict resolution.
Multi-cluster support within each region with operator-driven cluster assignment.
Multi-tenancy, horizontal scalability and elasticity within each cluster.

The above makes Venice particularly suitable as the stateful component backing a Feature Store, such as Feathr. AI applications feed the output of their ML training jobs into Venice and then query the data for use during online inference workloads.

Overview#

Venice is a system which straddles the offline, nearline and online worlds, as illustrated below.

High Level Architecture Diagram

Dependency#

You can add a dependency on Venice to any Java project as specified below. Note that, currently, Venice dependencies are not published on Maven Central and therefore require adding an extra repository definition. All published jars can be seen here. Usually, the project is released a few times per week.

Gradle#

Add the following to your build.gradle:

repositories {
  mavenCentral()
  maven {
    name 'VeniceJFrog'
    url 'https://linkedin.jfrog.io/artifactory/venice'
  }
}

dependencies {
  implementation 'com.linkedin.venice:venice-client:0.4.455'
}

Maven#

Add the following to your pom.xml:

<project>
...
  <repositories>
    ...
    <repository>
      <id>venice-jfrog</id>
      <name>VeniceJFrog</name>
      <url>https://linkedin.jfrog.io/artifactory/venice</url>
    </repository>
  </repositories>
...
  <dependencies>
    ...
    <dependency>
      <groupId>com.linkedin.venice</groupId>
      <artifactId>venice-client</artifactId>
      <version>0.4.455</version>
      <scope>compile</scope>
    </dependency>
  </dependencies>
</project>

APIs#

From the user's perspective, Venice provides a variety of read and write APIs. These are fully decoupled from one another, in the sense that no matter which write APIs are used, any of the read APIs are available.

Furthermore, Venice provides a rich spectrum of options in terms of simplicity on one end, and sophistication on the other. It is easy to get started with the simpler APIs, and later on decide to enhance the use case via more advanced APIs, either in addition to or instead of the simpler ones. In this way, Venice can accompany users as their requirements evolve, in terms of scale, latency and functionality.

The following diagram presents these APIs and summarizes the components coming into play to make them work.

API Overview

Write Path#

Venice supports flexible data ingestion:

Batch Push: Full dataset replacement from Hadoop, Spark
Incremental Push: Bulk additions without full replacement
Streaming Writes: Real-time updates via Apache Samza or the Online Producer
Write Compute: Partial updates and collection merging for efficiency
Hybrid Stores: Mix batch and streaming with configurable rewind time

Read Path#

Venice provides multiple read APIs and client options:

Read APIs:

Single get, batch get
Read compute with server-side operations (dot product, cosine similarity, field projection)

Client Types:

Thin Client: Stateless, 2 network hops, < 10ms latency
Fast Client: Partition-aware, 1 network hop, < 2ms latency
Da Vinci Client: Stateful local cache, 0 network hops, < 1ms latency

All clients share the same APIs, enabling flexible cost/performance optimization without code changes.

Change Data Capture (CDC): Stream all data changes (inserts, updates, deletes) for use cases like ML feature retrieval and client-side indexing.

For a comprehensive guide to Venice's architecture, write modes, client characteristics, and capabilities, see the Architecture Overview.

Resources#

The Open Sourcing Venice blog and conference talk are good starting points to get an overview of what use cases and scale can Venice support. For more Venice posts, talks and podcasts, see our Learn More page.

Getting Started#

Start with the Getting Started guide to learn Venice concepts and deploy your first cluster. The guide covers architecture fundamentals and provides quickstart instructions for both single and multi-datacenter deployments. We recommend sticking to our latest stable release.

Community#

Feel free to engage with the community using our:

Slack workspace
- Archived and publicly searchable on Linen
LinkedIn group
GitHub issues
Contributor's guide

Follow us to hear more about the progress of the Venice project and community: