Three views of time, part 1: SMIL

In a previous article, we mentioned that mainstream programming languages do not have any concept of time. This article and its two follow-ups will examine three languages (first, a markup language, then two programing languages) that focus on timing and synchronization. In this first article, we will see how SMIL defines a powerful timing and synchronization model for presenting multimedia content. Next, we will see how the ChucK programming language provides a sample-accurate scheduler for the production of computer music, from digital signal processing algorithms to musical composition; and finally, we will look at the synchronous reactive programming model of the Esterel language.

SMIL is the Synchronized Multimedia Integration Language, an XML-based markup language specified by the W3C. The first version of the SMIL specification was published in 1998, at a time when both multimedia and scripting capabilities of HTML were in their respective infancies. The last version of the specification, SMIL 3.0, was published in 2008. SMIL did not enjoy a wide adoption on the Web; the only significant part of the specification that is currently implemented in Web browsers is the Animation section of SVG. There are uses of SMIL outside of the Web however, for instance in MMS specifications, or in the DAISY and EPUB digital book formats.

SMIL is similar to HTML in that it provides a way to author multimedia presentations, with a focus on handling the timing relationship between media elements (such as text, images, audio and video), whereas HTML focuses on the spatial relationship within its (at the time when SMIL was introduced) static content. In place of a flow of paragraphs or lists, SMIL containers allow the presentation of content in sequence or in parallel; in place of width and height, the duration of the content can be controlled, either implicitly or explicitly.

Like HTML, SMIL is also declarative: everything that happens in a SMIL presentation is described through markup, without any scripting, JS or otherwise. HTML on its own has no timing and very little interactive abilities (mostly around forms); CSS adds animations, transitions and other features which, again, were not available when SMIL was created.

The timing and synchronization model of SMIL introduces mechanisms to describe the presentation of media over time. Fundamentally, this means describing when an element begins and ends. This depends on the duration of an element, which can be intrinsic (based on the element contents, like the natural length of a video file), or set to a given value (which can itself be derived from other elements beginning or ending, or events occurring).

In order to coordinate multiple elements, SMIL introduces three time containers:

seq, which plays its contents in sequence;
par, which plays its contents in parallel;
excl, which plays its contents one at a time, but without imposing any order.

The simple example below shows a fragment of SMIL describing the following scenario: play a video and its soundtrack together until they both end, or when the user clicks on a button; then show an image for 10 seconds.

<seq>
    <par endsync="first">
        <video src="film.mov"/>
        <audio src="sound.wav" dur="media" end="b.activateEvent"/>
    </par>
    <img src="done.png" dur="10s"/>
</seq>

A SMIL fragment

When the top-level seq begins, its first child element begins as well; when that child element ends, the following sibling begins, and so on until the last child ends, which ends the seq as well; its intrinsic duration is therefore the sum of the durations of its children. When the par element begins, all of its children begin at once, and the par ends when all children end; its intrinsic duration is the maximum duration of its children.

Note however that the endsync attribute of the par means that instead of waiting for all of its children to end, the par only waits for its first child to end. Additionally, the audio element specifies both a duration (dur="media") and a specific end time (end="b.activateEvent"), so that it ends either when its intrinsic duration (as specified by the special media value) elapses, or if the b (button) element is clicked. In both cases, the audio element ending will cause its parent to end, which will in turn cause the sibling video element to end as well, so that the image can be shown next. And since an image has no intrinsic duration, a duration of 10 seconds is specified for it.

Reproducing this in HTML is left as an exercise to the reader; while the equivalent audio and video elements can do the heavy lifting of playing the media content, all of the synchronization needs to be scripted, which involves timers and tracking user actions and the status of media playback. The resulting code will be harder to read and modify than its SMIL counterpart. The intent is clearly expressed in the markup and is much harder to understand from reading code. Some changes can also be trivial in SMIL but become much more involved when ad hoc code needs to be modified.

As an example, let’s say that the soundtrack to the movie actually contains a musical prelude; the audio is then longer, and if both audio and video start together, they will be out of sync. It is simple to delay the start of the video by adding a simple begin attribute (see the figure below). Now, to achieve the same effect in HTML, another timer needs to be introduced so that the movie begins at the right moment (making sure to cancel it if the audio ends early, of course).

<par endsync="first">
    <video begin="02:37" src="film.mov" end="movie"/>
    <audio src="sound.wav" dur="media" end="b.activateEvent"/>
</par>

Adding a begin attribute to the video

We have only scratched the surface of what SMIL offers in terms of timing and synchronization. We did not talk much about excl or about clipping, min and max timing, repetition, restarting, and so on. One feature of SMIL worth mentioning though is animation (as we saw above, it is used by SVG). SMIL provides the animate and set elements (among others) to change the value of an element attribute (such as its position, size, &c.) over time. What is interesting with animation is how it integrates with the timing model, so that an animation can be understood as another sort of media; instead of playing some random content, parts of the presentation are modified. Importantly, this means that dur, begin, end, and other attributes can be specified, and animation elements can be children of par, seq and excl.

The SMIL specification is very large, but at its core is a timing and synchronization model that is relatively simple and very powerful. It gives us concepts to define complex timing scenarios that include interactivity (so the timing does not have to be completely fixed in advance and can be adjusted to user inputs), and provides rules for scheduling the playback and display of multimedia elements and animations. Unfortunately, it was never fully implemented in Web browsers, so this power is not available to us and leaves us with inferior tools and a lot of scripting to achieve the results that it could provide in a much more convenient way. ⚁⚃