DigestedProteinDB Engine

Enterprise-grade, high-performance peptide indexing and search engine


Overview

The Enterprise Engine (Rust) is an ultra-optimized implementation of DigestedProteinDB designed for large-scale proteomics and metaproteomics.

The Java version serves as the open-source research reference, while this engine delivers sub-millisecond mass-range queries and seamless embedding into high-performance computing (HPC) pipelines.

Key capabilities

Ultra-fast mass search

Retrieve candidate peptides within narrow precursor mass windows in milliseconds, even for very large databases.

Custom database generation

Build tailored peptide indexes for specific organisms, taxonomies, enzymes, and digestion parameters.

Out-of-core design

No need to load the full digest into memory. Works efficiently with very large UniProt-scale datasets.

CLI-first architecture

Scriptable and pipeline-friendly command-line interface suitable for HPC and automated workflows.

Embeddable engine

Designed to be integrated into external software, services, or internal proteomics pipelines.

Enterprise customization

Supports custom builds, organism-specific databases, and performance tuning for production environments.

Performance snapshot

These results correspond to the largest database build we have benchmarked so far, which best reflects peak-scale performance. Additional snapshots for other database sizes will be added as they become available.

Enterprise Engine Specifications (Rust)
High-performance
Database Version UniProtKB/TrEMBL (Release 2026_01)
Taxonomy & Scope All Organisms (proteome-wide) - full global protein index.
Scale 202.55 million proteins / 17.20 billion peptides
Digestion Parameters Enzyme: Trypsin
Missed Cleavages: Up to 2 allowed
Peptide Length: 6 to 50 amino acids
Search Performance ~150 ms average mass-range query time.
Disk Footprint ~255 GB (Optimized with Snappy/Zstd compression & 5-bit encoding)
Memory Usage ~316 MB peak RAM (Out-of-core indexing)
Storage Engine RocksDB key-value store (optimized Rust core)
DigestedProteinDB — bash — 120x40
user@proteomics-hpc:~$ massq 1800.0000 1800.0002 --db-path ./uniprot_trembl_db

#       Mass      Peptide             Accession   TaxID
1       1800.0001
                AGASCPICKKEIQLVIK   Q2HJ21      9913
                AGASCPICKKEIQLVIK   O15151      9606
                AVARMSVLSELCLPLAK   Q1WRS8      362948
                GGKGDLCIVLNVLLMQK   Q83NI2      218496
                GGKGDLCIVLNVLLMQK   Q83MZ4      203267
                GLMPLGITDEIRKMVK    A2RMH5      416870
                GLMPLGITDEIRKMVK    Q02XB8      272622
                ILMGASVGIPASSLCIIR  Q92275      5334
                KGQIVMTSDKPPKMLK    A0RLX8      360106
                KTMPLILSGVDVVAMAR   O49289      3702
                MIPMIVLATTNQNKVK    Q6AQD7      177439
                ... (1118 additional entries) ...

2       1800.0002
                VEEIYEDDEMNT        A0A1Y2ULM3  9606
                CHGWGGCHHIR         A0A8R8NSG9  10090
                ... (2394 additional entries) ...
... 
[SUCCESS] Found 239 peptides matching criteria.
Execution time: 469.129µs
user@proteomics-hpc:~$ _

Architecture and workflow

The diagram is split into two parts: database construction (left) and mass-based search (right). The first shows the build pipeline from UniProtKB input through digestion and encoding into a RocksDB index. The second shows how experimental mass queries are matched against the indexed data to return candidate peptides.

High-level build and query flow for the enterprise engine.

Phase 1: Database Construction (Build Workflow) Pre-Build Configuration Multi-Stage Optimization Phase 2: Mass-Based Search (Query Workflow) UniProtKB Data In Silico Digestion Engine RocksDB Index NCBI Taxonomy Selection (TaxID / Division) Enzyme Selection (Trypsin/Chymotrypsin) Digestion Parameters (Length 6-50aa, MC=2) Mass Discretization (4 Decimal Places) 5-bit Peptide Encoding Base36 Accession Serialization Experimental Mass Input (Mass + Tolerance) Search Engine (Rust Core) Output: Peptides + Accessions + TaxIDs Binary Search & Sequential Scan

Downloads

Downloadable builds and pre-built databases will be published as releases become available.

CLI Binary (Rust)

Optimized Rust binary for Linux, macOS, and Windows. Features sub-millisecond query resolution and ultra-low memory footprint.

Pre-release testing in progress.

Pre-built Databases

Download the complete UniProtKB/TrEMBL 2026 digest (Trypsin, MC=2). Approximately 255 GB of optimized RocksDB data.

Direct high-speed mirror links will be provided.
Community / Research
  • Java open-source implementation
  • Academic and research use
  • Basic digestion and search
  • Evaluation and prototyping
Enterprise Engine (Rust)
  • High-performance optimized core
  • Low-overhead CLI execution
  • Designed for embedding
  • Custom database builds
  • Production- and HPC-ready

Typical use cases

  • large-scale proteomics database preparation
  • metaproteomics search space reduction
  • DIA/DDA candidate filtering by precursor mass
  • core facility peptide indexing infrastructure
  • backend engine for custom proteomics software
Enterprise access and collaboration

For custom database builds, pipeline integration, embedding support, or commercial licensing, please get in touch.

Contact