Multimodal Language Server Protocol
#translation
The Multimodal Language Server Protocol (MLSP) is a protocol, based on the Language Server Protocol for communication between a language server and a client. It is designed to support the development of language servers that can process non-textual data.
Imagine being able to open an MP3 file in VS Code (or some other app) and have a server in the background identify, on the fly,
- the language(s) spoken in the audio
- characteristics of the speaker(s), their emotion, inflection, etc.
- the words spoken
- byte-based positions where you have spliced in edits
- milestones or transition markers for navigation
- etc.
And then, imagine if your language server could automatically clean out background noise on the fly, either for the whole file, or for byte-ranges you select in your editor using a waveform.
Key benefits
The benefits of an MLSP are the same as the LSP:
- pushing heavy processes out of the client and into the background on a separate process
- enabling the same server (whether local or remote) to be used by multiple clients
These twin benefits are especially important for multimodal language servers, which may require significant computational resources to process non-textual data. By breaking them out into separate processes, we keep the client responsive.
Major features
Like the Language Server Protocol, the MLSP would be designed to do two things:
- Publish diagnostics (error, warning, info, hint levels) for a given byte position range in the file
- Update the content for a given byte position range
Separation of concerns is of the utmost importance. These diagnostics and content updates are completely agnostic to what you are trying to do with your multimodal language server. All that matters is the kinds of client procedures that can be called remotely from the server, and the kinds of server procedures that can be called remotely from the client.
Adapation of LSP
The goal is to leverage the existing LSP as much as possible, only extending (or disabling/ignoring) features where it is absolutely necessary.
Conclusion
In conclusion, the Multimodal Language Server Protocol (MLSP) represents a significant leap forward in the development of language servers. By extending the capabilities of the traditional Language Server Protocol to handle non-textual data, MLSP opens up new possibilities for developers and users alike. Whether it’s processing spoken language in audio files, identifying speaker characteristics, or automatically cleaning background noise, MLSP has the potential to revolutionize the way we interact with digital content. I’m excited to continue exploring and expanding the boundaries of what language servers can do.
Reach out to me at ryderwishart at gmail dot com if you’re interested in contributing to the project.