proposal for universal closed caption support in NDI

This page contains a technical proposal which is not currently a published or formal standard. If you have comments or feedback on the content of this proposal PLEASE contact us with your contribution.


The NDI IP Video protocol SDK does not currently offer any guidance on provision for closed caption metadata. However, NDI has multiple real time metadata mechanisms which could easily carry closed caption data. This document is an attempt to standardise support for closed captions in order to prevent different implementations by each vendor by way of a recommended practice technical note.


The existing proposal (RP-NDI-CC-708-2018) provides raw support for CEA-708 captions as a bitstream which provides a simple method to capture from SDI, carry within NDI, then restore C708 captions into SDI bitstreams.  This standard may b expanded to carry other VANC data in raw bitrstream format - but alongside this, there is a need to standardise an interpreted form of the caption data, to avoid the need for every application to individually decompose the bitstream (non trivial with standards like CEA-708).  This raises the opportunity to create a parallel metadata stream which is already interpreted and standardised into a universal set which could apply to all types of closed caption, regardless of their original format.


For DVB-T based captions (which are delivered as a bitmap in the video bitstream), this also provides a standardised text representation of those captions.


Applications include on screen display of captions, but also deeper workflows such as text analysis and metadata logging.  This standard can also be used by speech to text systems to deliver their output in a standardised and accessible format.


Basic Premise:


The objectives are to allow the following scenarios:

-  Represent closed caption data in a universal format, regardless of source

- The format should include simple messages with clear text, alongside control messages for on screen presentation which can easily be ignored by systems only interested in the text.

- Caption timing is implied by the real time nature of the NDI stream, just as it is in video bitstreams.


Initial development has been for the CEA-708 Standard which defines windows and a pen for multiple services (languages). The Windows and Pen can be styled to control position, color etc. The standard should accommodate all this information in a format which could ideally allow this metadata stream to be re-constructed into the CEA-708 bitstream without losing any information.  That remains a goal of this standard as it develops, with the initial objective to allow for easier interpretation of the bitsream by metadata readers.  To provide the most accessible, understandable and standardised syntax, basic HTML/CSS is used wherever possible within this standard.  In this case the span html element carries the raw text, whilst the div element acts for the caption window.


HTML Notes: An attempt is made to cleanly map between CEA708 styling and CSS styles. In the case of window positioning, CEA708 allows a window to be positioned at a percentage of the screen size, thus all positions and sizes in this standard use the percentage of video screen size.



This standard uses the real time (non frame based) NDI Metadata stream


Simplest Example :


NDIlib_metadata_frame_t meta_data

meta_data.p_data =

<CAPTION><span>Hello World!</span></CAPTION>

This example details a simple string of text (UTF8) with no styling, or target window for the default (unspecified) service - assumed to be the primary language service.

This message implies that this text should be 'displayed' immediately, as relevant to the current video frames. This is the simplest form and may be used by tools such as metadata analysers to get right down to the raw text in complete sentances. Even when the format is expressed in a more complex form, metadata readers can simple search for the span tags to gather text.  Note that the span tag may also be styled. This simplest format is suitable for delivery of continuous voice recognition systems which typically have no context, no discrete sentences or punctuation.



Enhanced Example :

<CAPTION service="1" action="create" standard="C708" >

<div id="0" style="width:24%;height:6%;top:93%;left:30%;visibility:visible;z-index:7;text-align:left;">

<span>Hello World!</span>



This example sends the same string to the screen, but it specifies that this string is for caption service 1 (primary service, 0 is not used) and window index 0 with the window positioned at 30% x 93% on screen and 24% x 6% in size. The window priority maps to z-index and justification is mapped with text-align.  The visibility maps to the CEA-708 flag to determine that this window should be immediately visible. The create action attribute means this is creation of a new caption in window 0 which would supercede any previous state of this window.  Other action attributes include "restyle" which allow for overriding of the window style (for example to hide it) without affecting the text.  Also the delete action allows for removal of the window (similar to hiding it).  Finally, the append action allows for adding additional text to an existing window without otherwise affecting it.







If you have any questions, or you would like to engage Sienna for NDI Consultancy or Custom Development, please contact info @ sienna.tv