Chaitanya

| Technical Writer: Pooja Gramopadhye | Technical Review: ABCOM Team | Copy Editor: Anushka Devasthale | Level: Intermediate | Banner Image Source : Internet

Banner

Image source Internet

As an Exhibitor, you are displaying your precious collection of jewelry to the public in an open exhibit hall visited by hundreds. You do want your visitors to have quite a close look at your collection, but at the same time, you need assurance of security. The obvious way to ensure safety is to have a surveillance system in place. Such surveillance helps in catching the thief post the theft. Wouldn’t it be nice to have a system that alarms everyone in case of any wrongdoing? Can you imagine that somebody has already developed a Machine Learning model for this task? Well, not precisely for catching thefts like in the described situation, but certainly for finding several human functions in a live video stream. The model is trained to identify about 400 different human activities. Some such activities include washing hands, cooking pizza, kicking, and so on. The complete list of events that the model recognizes is given in the actions_recognition_kinetics.txt file included in the project download. The use of this model is trivial and requires only a few lines of Python code. I will now show you how to use this model.

Project Setup

The project consists of a single Python file with just a few lines of code. Download the project source from our repository in the desired folder on your local drive. You will see the following folder structure.
folder
The only Python file that the project uses is recognize_ human_activity.py. Download a resnet-34_kinetics.onnx model from here (source: original paper) and once downloaded, drop it inside our project's model directory (shown in directory structure). I have also kept a sample self-video named myactions.mp4 in the test folder. The application will detect the actions in this video file and print the action names on the video. There are three actions recorded in the video.

Running the Application

To run the application itself, use the following command:

python recognize_human_activity.py

When you run the above command, the myactions.mp4 will start playing in a popup window. As the video plays, the stream's actions will start displaying in the left-hand corner of the window. The screenshots of the three activities are shown in the following figure.

action1
action2
action3

Attempting Your Actions

You may like to try the application on your video. Wait, you need not capture the video. The application is capable of detecting actions in a real-time video. For this, you will need to capture live action on your webcam. Open our only source file – recognize_human_activity.py, locate the following line:

        self.VIDEO_PATH = "test/myaction.mp4"

Change this line to the following:

        self.VIDEO_PATH = None

Save the file and rerun the program. Do whatever actions that you want in front of the webcam. Whenever the application detects the actions known to it, the action name will be printed on the top of the window.

As you have successfully run the application and observed its working, I will explain the code behind it.

How Does it Work?

As the source program is small, I am giving below the full source. I have explained the essential portions of the code after the source listing.

from collections import deque
import numpy as np
import cv2

# Parameters class include important paths and constants
class Parameters:
    def __init__(self):
        self.CLASSES = open("model/action_recognition_kinetics.txt"
                            ).read().strip().split("\n")
        self.ACTION_RESNET = 'model/resnet-34_kinetics.onnx'
#       self.VIDEO_PATH = None
        self.VIDEO_PATH = "test/myaction.mp4"
        # SAMPLE_DURATION is maximum deque size
        self.SAMPLE_DURATION = 16
        self.SAMPLE_SIZE = 112

# Initialise instance of Class Parameter
param = Parameters()

# A Double ended queue to store our frames captured and with time
# old frames will pop
# out of the deque
captures = deque(maxlen=param.SAMPLE_DURATION)

# load the human activity recognition model
print("[INFO] loading human activity recognition model...")
net = cv2.dnn.readNet(model=param.ACTION_RESNET)

print("[INFO] accessing video stream...")
# Take video file as input if given else turn on web-cam
# So, the input should be mp4 file or live web-cam video
vs = cv2.VideoCapture(param.VIDEO_PATH if param.VIDEO_PATH else 0)

while True:
    # Loop over and read capture from the given video input
    (grabbed, capture) = vs.read()

    # break when no frame is grabbed (or end if the video)
    if not grabbed:
        print("[INFO] no capture read from stream - exiting")
        break

    # resize frame and append it to our deque
    capture = cv2.resize(capture, dsize=(550, 400))
    captures.append(capture)

    # Process further only when the deque is filled
    if len(captures) < param.SAMPLE_DURATION:
        continue

    # now that our captures array is filled we can
    # construct our image blob
    # We will use SAMPLE_SIZE as height and width for
    # modifying the captured frame
    imageBlob = cv2.dnn.blobFromImages(captures, 1.0,
                                       (param.SAMPLE_SIZE,
                                        param.SAMPLE_SIZE),
                                       (114.7748, 107.7354, 99.4750),
                                       swapRB=True, crop=True)

    # Manipulate the image blob to make it fit as as input
    # for the pre-trained OpenCV's
    # Human Action Recognition Model
    imageBlob = np.transpose(imageBlob, (1, 0, 2, 3))
    imageBlob = np.expand_dims(imageBlob, axis=0)

    # Forward pass through model to make prediction
    net.setInput(imageBlob)
    outputs = net.forward()
    # Index the maximum probability
    label = param.CLASSES[np.argmax(outputs)]

    # Show the predicted activity
    cv2.rectangle(capture, (0, 0), (300, 40), (255, 255, 255), -1)
    cv2.putText(capture, label, (10, 25), cv2.FONT_HERSHEY_SIMPLEX,
                0.8, (0, 0, 0), 2)

    # Display it on the screen
    cv2.imshow("Human Activity Recognition", capture)

    key = cv2.waitKey(1) & 0xFF
    # Press key 'q' to break the loop
    if key == ord("q"):
        break

The program first defines a few constants in a class called Parameters. You have already used one of the parameters called VIDEO_PATH while running the application with a webcam. The SAMPLE_DURATION determines the queue size, which is set to 16. It means that we will be capturing 16 consecutive images before processing those for action detection. The SAMPLE_SIZE is used for setting the image size while resizing the captured images.

The frames are captured in a double-ended queue of length decided by the SAMPLE_DURATION parameter.

captures = deque(maxlen=param.SAMPLE_DURATION)

We load the model by calling the readNet method of cv2.

net = cv2.dnn.readNet(model=param.ACTION_RESNET)

We capture the video by calling the VideoCapture method of cv2.

vs = cv2.VideoCapture(param.VIDEO_PATH if param.VIDEO_PATH else 0)

We set up an infinite loop for reading the stream.

while True:
    # Loop over and read capture from the given video input
    (grabbed, capture) = vs.read()

We resize the captured frame and append it to the previously set queue.

# resize frame and append it to our deque
    capture = cv2.resize(capture, dsize=(550, 400))
    captures.append(capture)

After the queue is filled, we create image blobs from the captured frames:

    imageBlob = cv2.dnn.blobFromImages(captures, 1.0,
                                       (param.SAMPLE_SIZE,
                                        param.SAMPLE_SIZE),
                                       (114.7748, 107.7354, 99.4750),
                                       swapRB=True, crop=True)

Preprocess the image blob to the model’s required input format.

    imageBlob = np.transpose(imageBlob, (1, 0, 2, 3))
    imageBlob = np.expand_dims(imageBlob, axis=0)

Feed this processed blob to the model and forward pass through it for the predictions.

    net.setInput(imageBlob)
    outputs = net.forward()

Take the maximum probability as the model’s prediction:

    label = param.CLASSES[np.argmax(outputs)]

Display the prediction in the top left corner of the window:

    cv2.rectangle(capture, (0, 0), (300, 40), (255, 255, 255), -1)
    cv2.putText(capture, label, (10, 25), cv2.FONT_HERSHEY_SIMPLEX,
                0.8, (0, 0, 0), 2)

    # Display it on the screen
    cv2.imshow("Human Activity Recognition", capture)

    key = cv2.waitKey(1) & 0xFF
    # Press key 'q' to break the loop
    if key == ord("q"):
        break

As the application quits, when the user hits q on his keyboard, I have not explicitly closed the webcam.

Conclusion

This trivial application has demonstrated the power of cv2 and the pre-trained models to capture human actions. The model currently captures about 400 activities with about 78 to 95% accuracy. Such models can further be trained on specific human actions and may be useful for detecting theft in real-time. You may find several other valuable areas such as keeping an eye on washing hands regularly during the current COVID19 – epidemic, and so on.

Source: Download the project source from our Repository.

image