How to Leverage More Data Without Making More Decisions

translation

If you are working with Bible translation data, you know how challenging it can be to deal with different types of data sources, formats, and structures. You may have to spend a lot of time and effort to model your data in a way that makes sense for your specific use case, whether it is creating a graph, a relational database, or a document database. 😒 But what if there were a better way? What if you could focus on gathering and integrating more data, without having to make all the decisions up front about how to model it?

A simple database-like interface

That's the idea behind a simple database-like interface that we have been exploring recently. It is based on the principle of 🔥 "more data, and less decisions about data". ✨ The interface allows you to link together many types of wide-column datasets in various formats, such as JSON, XML, plaintext, and even audio, image, etc. Every file can be a first-class citizen 🫡, and files may make reference to other files or indexes within other files. ☝️ This way, you can create a rich and flexible data-lake that can store and access any kind of data you need.

The advantage of this approach is that you don't have to worry about how to schematize everything in advance. You can start with the raw data you have, and then build on top of it as you go. You can create graphs or relational databases on top of this NoSQL data, using queries that can join and build complex structured implementations. And the best part is that you can reuse these queries and store them forever, so that you don't have to repeat the same work over and over again. For example, you can create an alignment between two languages, and then include IDs or vectors that allow you to tunnel through to image data, audio Bible segments, or key paragraphs in commentary texts.

Persisting joined data

How do you keep the joined or queried data around indefinitely? You just save it in a new JSON file and plunk it into your folder along with all the original data.

While the idea of saving the joined or queried data in a new JSON file and adding it to your folder along with all the original data might seem straightforward, it's important to consider the potential confusion this could cause in the long run. Just as saving different versions of a document with abstract names can lead to confusion about which version is the most recent or correct, the same can happen with data files.

To avoid this, one might utilize some simple metadata in a README file and mapping the origins of data. A simple solution could be to always include the script that created the data, with the script pulling its sources from a specific commit of another repository. This way, the "version" of the data can always be tracked. If a core repository is updated, all other repositories using the old commit can regenerate their data. This process could potentially be automated as a GitHub action that appends a list of added/updated files to the README (which would in theory include the creation script and associated output files).

These updates could be documented in the README, ordered by recent updates. This would provide a clear record of all new files added, optional descriptions, and timestamps for when files were created or updated, along with the commit hash.

Summary

This is not a new vision of a kind of database—it's probably just a NoSQL database. Maybe we could call it an Agile Data Hub. 🤔 Whatever you call it, the application for Bible translation data is important, since we're talking about many kinds of structured and unstructured data, especially in light of our ability to inexpensively extract structured data using LLMs. By using this simple database-like interface, you can speed up your delivery times and workflows, and leverage more data without making more decisions. Isn't that what we all want?

Note: This post was inspired by recent experiences trying to create an alignment POC for Tok Pisin, and a subsequent schema-brainstorming session with Ben Scholtens. Thanks, Ben!