Multimodal Language Server Protocol

#translation

The Multimodal Language Server Protocol (MLSP) is a protocol, based on the Language Server Protocol for communication between a language server and a client. It is designed to support the development of language servers that can process non-textual data.

Imagine being able to open an MP3 file in VS Code (or some other app) and have a server in the background identify, on the fly,

the language(s) spoken in the audio
characteristics of the speaker(s), their emotion, inflection, etc.
the words spoken
byte-based positions where you have spliced in edits
milestones or transition markers for navigation
etc.

And then, imagine if your language server could automatically clean out background noise on the fly, either for the whole file, or for byte-ranges you select in your editor using a waveform.

Key benefits

The benefits of an MLSP are the same as the LSP:

pushing heavy processes out of the client and into the background on a separate process
enabling the same server (whether local or remote) to be used by multiple clients

These twin benefits are especially important for multimodal language servers, which may require significant computational resources to process non-textual data. By breaking them out into separate processes, we keep the client responsive.

Major features

Like the Language Server Protocol, the MLSP would be designed to do two things:

Publish diagnostics (error, warning, info, hint levels) for a given byte position range in the file
Update the content for a given byte position range

Separation of concerns is of the utmost importance. These diagnostics and content updates are completely agnostic to what you are trying to do with your multimodal language server. All that matters is the kinds of client procedures that can be called remotely from the server, and the kinds of server procedures that can be called remotely from the client.

Adapation of LSP

The goal is to leverage the existing LSP as much as possible, only extending (or disabling/ignoring) features where it is absolutely necessary.

Conclusion

In conclusion, the Multimodal Language Server Protocol (MLSP) represents a significant leap forward in the development of language servers. By extending the capabilities of the traditional Language Server Protocol to handle non-textual data, MLSP opens up new possibilities for developers and users alike. Whether it’s processing spoken language in audio files, identifying speaker characteristics, or automatically cleaning background noise, MLSP has the potential to revolutionize the way we interact with digital content. I’m excited to continue exploring and expanding the boundaries of what language servers can do.

Reach out to me at ryderwishart at gmail dot com if you’re interested in contributing to the project.

Multimodal Language Server Protocol

Table of Contents

Multimodal Language Server Protocol

Key benefits

Major features

Adapation of LSP

Conclusion