Using Azure Cognitive Services to Caption Video

In this post we will explore the use of Microsoft Azure Cognitive Services to caption video images caught from a webcam in real time.

We will start by importing some libraries:

datetime for displaying the current time on our labels
requests to make requests to the Cognitive Services API
cv2 (OpenCV) for streaming images from our webcam

import datetime
import requests

import cv2 # opencv

You will need to sign up for an Azure Cognitive Services subscription API key.

subscription_key = # your subscription key here
vision_base_url = "https://westeurope.api.cognitive.microsoft.com/vision/v1.0/"
vision_analyze_url = vision_base_url + "analyze"

Now we define a function to make a request to Azure Cognitive Services and return a caption with the time of the request that we can use to caption our image.

def get_image_caption(image):
    img = cv2.imencode('.jpg', image)[1].tostring()
    # Request headers
    headers  = {'Content-Type': 'application/octet-stream',
                'Ocp-Apim-Subscription-Key': subscription_key }
    # Request params
    params   = {'visualFeatures': 'Categories,Description,Color'}
    # Make a request to the Azure Cognitive Services Vision Analysis API
    response = requests.post(vision_analyze_url,
          params=params,
          headers=headers,
          data=img)
    # Extract the json from the response
    analysis = response.json()
    # Extract the captions from the output json.
    image_caption = analysis["description"]["captions"][0]["text"].capitalize()
    time_now = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    # Caption with current time
    return (time_now + ": " + image_caption)

We set a count and each time the count reaches 50, a request is sent to the Cognitive Services API. The caption from the API is overlayed on the webcam image.

Press the ‘q’ button to quit at any point.

stream = cv2.VideoCapture(0)
key = None
count = 0
overlay_text = ""

while True:
    if count % 3 != 0:
        # Read frames from live web cam stream
        (grabbed, frame) = stream.read()

        # Make copies of the frame for transparency processing
        output = frame.copy()
        
        # Get caption to overlay
        if count % 50 == 0:
            overlay_text = get_image_caption(output)
        
        # Overlay caption on image
        cv2.putText(output, overlay_text, (10, 30),
            cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 2)

        # Show the frame
        cv2.imshow("Image", output)
        key = cv2.waitKey(1) & 0xFF
        
    count +=1
    # Press q to break out of the loop
    if key == ord("q"):
        break

# cleanup
stream.release()
cv2.waitKey(1)
cv2.destroyAllWindows()
cv2.waitKey(1)

-1

Below are some webcam screenshots with an idea of the kind of output you will see.

screen1
screen2 screen3