Creating an API for Video Processing with NodeJS. Part 1: A shape of API.

In the article, we go through all the steps of Developing a universal API for video processing

Published in

Numatic Ventures

9 min readSep 25, 2023

Thank you, Alex Zharkov for helping with the cover

There is a growing trend of using new streaming servers, and various applications incorporate video processing features in their business model. The business requirement may look like this — add some branding logo under the video or some text, probably also need to trim the original video or change its dimension, and so on.

This article outlines the process of implementing a Proof of Concept of the video service in a step-by-step manner and provides suggestions for future extensions. In the first part, we focus on developing the API's shape and declaring all necessary types using Typescript. In the upcoming article, we will delve into the specific implementation details (link here).

In a previous article titled “How to Process Video with FFmpeg and NodeJS,” I explained how to create a video composition based on a specific design. However, this method is not a universal solution as it requires a lot of extra development for each new brand video composition. In this article, I will demonstrate how to develop a video composition in a generic way, where all manipulations are defined in JSON format in a declarative manner. For instance, if you want to modify the size of the video, add a logo and text below it, horizontally flip and trim its duration, you can define your intentions using this format:

{
  "composition": {
    "type": "video",
    "path": "path to video resource",
    "start": "00:00:05",
    "length": "00:00:10",
    "operations": [
      {
        "type": "crop",
        "width": 200,
        "height": 300
      },
      {
        "type": "horizintalFlip"
      },
      {
        "type": "scale",
        "width": 200,
        "height": -1
      }
    ],
    "overlays": [
      {
        "type": "image",
        "path": "path to logo resource"
      },
      {
        "type": "text",
        "text": "Hello",
        "x": 54,
        "y": 300
      }
    ]
  },
  "output": {
    "format": "mp4"
  }
}

Even if you are unfamiliar with this API, you can quickly understand how to create a new composition by reviewing multiple usage examples.

💡 API Generalization

Let's define the shape of our JSON API by asking and answering crucial questions and iteratively developing it step by step.

Any video can be created by combining various media assets. The first question is:

❓ What type of media assets are we going to handle?

At least three types come to my mind: video, image and text. Let’s reflect it in a new TypeScript type:

export enum AssetTypes {
   video = 'video',
   image = 'image',
   text = 'text'
}

This type will be used as a discriminator and, in code, distinguish how we should process particular assets.

For video and image we must also define a mandatory field containing a path to media assets - a string specifying regular files or URLs. Therefore, the type in TS may look like this:

export type VideoOrImageAssetBase = {
   type: AssetTypes.video | AssetTypes.image;
   path: string;
};

Probably, the next question should be:

❓ What type of operations we would like to apply to “video” / ”image” assets?

It would be nice to have options for scaling, cropping, and flipping horizontally the media. As we have decided to use the FFmpeg framework, we need to find specific filters for these operations and, based on the documentation, define the parameters required to configure them in developing API:
- scale (link on documentation)
- crop (link on documentation)
- horizontal flip (link on documentation)

👉 In general, it’s not a good idea to build a generic API based on the particular framework, since in the future potentionally it may be changed, but it can be a good starting point.

Let’s define a generic type that would describe an operation, and it should have at least one mandatory field — the type of the operation. All operation types we highlight in an enum:

export enum AssetOperationTypes {
  crop = 'crop',
  scale = 'scale',
  horizintalFlip = 'horizintalFlip',
}


export type AssetOperationBase = {
   type: AssetOperationTypes
}

Let’s define specific fields for each operation based on FFmpeg documentation. Let’s start with the cropoperation (link).

When learning about filter parameters from documentation, it is important to first identify all mandatory fields. Any non-mandatory fields typically have default values. For example, the crop filter only requires width and height properties, while all other fields are optionals. If these optional fields are not applicable for your project, you may choose to skip them and not include in the developing API

Therefore, a type for crop operation has the following shape:

export type CropOperation = AssetOperationBase & {
  type: AssetOperationTypes.crop;
  width: string | number;
  height: string | number;
  x?: string | number;
  y?: string | number;
};

After reviewing the documentation, the other operations (scale, horizontalFlip) may have the following shapes:

export type ScaleOperation = AssetOperationBase & {
  type: AssetOperationTypes.scale;
  width: string | number;
  height: string | number;
};

export type HorizonatalFlipOperation = AssetOperationBase & {
  type: AssetOperationTypes.horizintalFlip;
};

Let’s also create a type of AssetOperation where include all defined before operation types:

export type AssetOperation =
  | ScaleOperation
  | CropOperation
  | HorizonatalFlipOperation;

Since we may apply one or many operations to Video/Image assets, let’s add operations field to VideoOrImageAssetBase (this type we are already declared above):

export type VideoOrImageAssetBase = {
  type: AssetTypes.video | AssetTypes.image;
  path: string;
  operations?: AssetOperation[]; // <== new field
};

❓ How we may trim video duration?

Let’s separately talk about video media type. We should have a chance to cut video from a particular starting timestamp with a particular duration; let’s reflect this in a new type:

export type VideoAsset = VideoOrImageAssetBase & {
  type: AssetTypes.video;
  start?: string;
  length?: string;
};

Since, start and length fields do not make sense for image media assets; therefore, the typescript shape of type for the image looks like this:

export type ImageAsset = VideoOrImageAssetBase & {
  type: AssetTypes.image;
};

❓ What is “text” media type about?

When you look at media types again, you may ask that we define 3 types of assets, but the text type has not been mentioned so far. Well, the text media type is useless by itself. But it makes sense when we would like to put it on top of the media assets with type image or video.

The FFmpeg has a special drawtext filter (official documentation here). Learning its parameters, let’s define a new type for this type of asset:

export type TextAsset = AssetBase & {
  type: AssetTypes.text;
  text: string;
  size?: number;
  color?: string;
  font?: string;
  fontfile?: string;
  box?: string;
  boxcolor?: string;
  boxborderw?: number;
};

As we defined above, that text media assets are supposed to be used as overlays for video or image , it means that we need to define a new type, like OverlayPosition which would describe the offsets where text will be drawn within the “video” / “image” frame: x / y :

export type OverlayPosition = {
   x?: number | string;
   y?: number | string;
};

Also, we agreed above that text media asset could not exist by itself, and it has to have any image or video asset where we need to place it. It means that we should enhance our BaseMediaAsset type in this way:

export type OverlayAsset = TextAsset & OverlayPosition;

export type VideoOrImageAssetBase = {
  type: AssetTypes.video | AssetTypes.image;
  path: string;
  operations?: AssetOperation[];
  overlays?: OverlayAsset[]; // <== new field
};

Please pay attention that we may put as many text media assets as we would like and, therefore, the optional overlays filed is presented as an array.

Regarding tooverlays field, probably, the next question may come to your mind:

❓ Is it possible to put on top of “video” / ”image” media asset another “video” / “image” assets, not only “text”?

Yep, we can. FFmpeg has a special overlay filter (official documentation here), and therefore OverlayAsset type may be extended in this way:

export type OverlayAsset = 
   (TextAsset | VideoAsset | ImageAsset) & OverlayPosition;

Here, I would like to highlight one question — could the video/image that is used as an overlay have their own array of overlays? Theoretically, it’s possible to implement using recursion. However, I think it is a redundant over-complication since the exact composition of overlays may be completed with one-level nestings of overlays.

Let’s examine a scene with three sources and explore how we can create it in two ways: using nested overlays or one overlay with a flatted structure. The difference between them is relative, which the source we calculate coordinates position x and y of overlay:

Let’s reflect this type of restrictions in type then excluding overlays properties :

export type OverlayAsset = (
  | TextAsset
  | Exclude<VideoAsset, 'overlays'>
  | Exclude<ImageAsset, 'overlays'>
) &
  OverlayPosition;

We defined types for all supported media items. As a final step, we need to create a new type that would include all of them and define the input shape of JSON:

export type VideoComposition = VideoAsset | ImageAsset;

export InputVideoJSON = {
   composition: VideoComposition
}

And probably the last question:

❓ Is it possible to define which format of output video file we would like to get — wether it’s MP4 or WEBM?

Let’s define the type for this setting and extend InputVideoJSON type:

export type OutputSettings = {
  format: 'mp4' | 'webm';
};

export type VideoComposition = {
  composition: VideoAsset | ImageAsset;
  output: OutputSettings;
};

I’m satisfied with API for now.

The full file with typescript declarations looks like this:

export enum AssetTypes {
  video = 'video',
  image = 'image',
  text = 'text',
}

// Types of available assets
export type VideoOrImageAssetBase = {
  type: AssetTypes.video | AssetTypes.image;
  path: string;
  operations?: AssetOperation[];
  overlays?: OverlayAsset[];
};

export type VideoAsset = VideoOrImageAssetBase & {
  type: AssetTypes.video;
  start?: string;
  length?: string;
};

export type ImageAsset = VideoOrImageAssetBase & {
  type: AssetTypes.image;
};

export type TextAsset = {
  type: AssetTypes.text;
  text: string;
  size?: number;
  color?: string;
  font?: string;
  fontfile?: string;
  box?: string;
  boxcolor?: string;
  boxborderw?: number;
};

// Overlay Assets
export type OverlayPosition = {
  x?: number | string;
  y?: number | string;
};

export type OverlayAsset = (
  | TextAsset
  | Exclude<VideoAsset, 'overlays'>
  | Exclude<ImageAsset, 'overlays'>
) &
  OverlayPosition;

// Asset Operations
export enum AssetOperationTypes {
  crop = 'crop',
  scale = 'scale',
  horizintalFlip = 'horizintalFlip',
}

export type AssetOperationBase = {
  type: AssetOperationTypes;
};

export type CropOperation = AssetOperationBase & {
  type: AssetOperationTypes.crop;
  width: string | number;
  height: string | number;
  x?: string | number;
  y?: string | number;
};

export type ScaleOperation = AssetOperationBase & {
  type: AssetOperationTypes.scale;
  width: string | number;
  height: string | number;
};

export type HorizonatalFlipOperation = AssetOperationBase & {
  type: AssetOperationTypes.horizintalFlip;
};

export type AssetOperation =
  | ScaleOperation
  | CropOperation
  | HorizonatalFlipOperation;

export type OutputSettings = {
  format: 'mp4' | 'webm';
};

export type VideoComposition = {
  composition: VideoAsset | ImageAsset;
  output: OutputSettings;
};

API extensions

When you get closer to the FFmpeg API and find out about new filters, you may extend the API we have already developed. Now, we may process a composition that contains only one asset. But probably, we would like to have a way to concatenate several video compositions together with a smooth transition in the places where the video is glued. FFmpeg has an amazing filter xfade with 30+ different types of transitions. Official documentation here.

Therefore, we may extend our API and introduce a new type of asset, for example merge:

export enum AssetTypes {
   video = 'video',
   image = 'image',
   text = 'text'
   merge = 'merge' // <== new type of asset
}

This type of asset must consist of several other assets arranged in an array. To keep things simple, it should be a video asset, but you can also create an asset that generates a video from an image. Each asset in the array must have a defined transition type and duration. It may look like this:

// https://trac.ffmpeg.org/wiki/Xfade
export enum ClipTransitionEffect {
  fade = 'fade',
  fadeBlack = 'fadeblack',
  // other transition types skipped for brevity
}

export type MergeAsset = {
  type: AssetTypes.merge;
  chunks: Array<
    VideoAsset & {
      effect?: { type: ClipTransitionEffect; duration?: number };
    }
  >;
};

Don’t forget to extend VideoComposition type with a new declared type of assets:

export type VideoComposition = {
  composition: VideoAsset | ImageAsset | MergeAsset;
  output: OutputSettings;
};

Conclusions:

In the article, we create a shape of Abstract API for video processing that is supposed to use FFmpeg under the hood. In the next article, we implement it using NodeJS and FFmpeg framework — the link below:

NodeJS: Generic server to transform video

We will implement the NodeJS video service that allows us to transform videos using declarative instructions (scale…

javascript.plainenglish.io

Sources: