| Technical Writer: Pooja Gramopadhye | Technical Review: ABCOM Team | Copy Editor: Anushka Devasthale | Level: Intermediate | Banner Image Source : Internet
Image source Internet
As an Exhibitor, you are displaying your precious collection of jewelry to the public in an open exhibit hall visited by hundreds. You do want your visitors to have quite a close look at your collection, but at the same time, you need assurance of security. The obvious way to ensure safety is to have a surveillance system in place. Such surveillance helps in catching the thief post the theft. Wouldn’t it be nice to have a system that alarms everyone in case of any wrongdoing? Can you imagine that somebody has already developed a Machine Learning model for this task? Well, not precisely for catching thefts like in the described situation, but certainly for finding several human functions in a live video stream. The model is trained to identify about 400 different human activities. Some such activities include washing hands, cooking pizza, kicking, and so on. The complete list of events that the model recognizes is given in the actions_recognition_kinetics.txt file included in the project download. The use of this model is trivial and requires only a few lines of Python code. I will now show you how to use this model.
The project consists of a single Python file with just a few lines of code. Download the project source from our repository in the desired folder on your local drive. You will see the following folder structure.
The only Python file that the project uses is recognize_ human_activity.py. Download a resnet-34_kinetics.onnx model from here (source: original paper) and once downloaded, drop it inside our project's model directory (shown in directory structure). I have also kept a sample self-video named myactions.mp4 in the test folder. The application will detect the actions in this video file and print the action names on the video. There are three actions recorded in the video.
Running the Application
To run the application itself, use the following command:
When you run the above command, the myactions.mp4 will start playing in a popup window. As the video plays, the stream's actions will start displaying in the left-hand corner of the window. The screenshots of the three activities are shown in the following figure.
Attempting Your Actions
You may like to try the application on your video. Wait, you need not capture the video. The application is capable of detecting actions in a real-time video. For this, you will need to capture live action on your webcam. Open our only source file – recognize_human_activity.py, locate the following line:
self.VIDEO_PATH = "test/myaction.mp4"
Change this line to the following:
self.VIDEO_PATH = None
Save the file and rerun the program. Do whatever actions that you want in front of the webcam. Whenever the application detects the actions known to it, the action name will be printed on the top of the window.
As you have successfully run the application and observed its working, I will explain the code behind it.
How Does it Work?
As the source program is small, I am giving below the full source. I have explained the essential portions of the code after the source listing.
from collections import deque import numpy as np import cv2 # Parameters class include important paths and constants class Parameters: def __init__(self): self.CLASSES = open("model/action_recognition_kinetics.txt" ).read().strip().split("\n") self.ACTION_RESNET = 'model/resnet-34_kinetics.onnx' # self.VIDEO_PATH = None self.VIDEO_PATH = "test/myaction.mp4" # SAMPLE_DURATION is maximum deque size self.SAMPLE_DURATION = 16 self.SAMPLE_SIZE = 112 # Initialise instance of Class Parameter param = Parameters() # A Double ended queue to store our frames captured and with time # old frames will pop # out of the deque captures = deque(maxlen=param.SAMPLE_DURATION) # load the human activity recognition model print("[INFO] loading human activity recognition model...") net = cv2.dnn.readNet(model=param.ACTION_RESNET) print("[INFO] accessing video stream...") # Take video file as input if given else turn on web-cam # So, the input should be mp4 file or live web-cam video vs = cv2.VideoCapture(param.VIDEO_PATH if param.VIDEO_PATH else 0) while True: # Loop over and read capture from the given video input (grabbed, capture) = vs.read() # break when no frame is grabbed (or end if the video) if not grabbed: print("[INFO] no capture read from stream - exiting") break # resize frame and append it to our deque capture = cv2.resize(capture, dsize=(550, 400)) captures.append(capture) # Process further only when the deque is filled if len(captures) < param.SAMPLE_DURATION: continue # now that our captures array is filled we can # construct our image blob # We will use SAMPLE_SIZE as height and width for # modifying the captured frame imageBlob = cv2.dnn.blobFromImages(captures, 1.0, (param.SAMPLE_SIZE, param.SAMPLE_SIZE), (114.7748, 107.7354, 99.4750), swapRB=True, crop=True) # Manipulate the image blob to make it fit as as input # for the pre-trained OpenCV's # Human Action Recognition Model imageBlob = np.transpose(imageBlob, (1, 0, 2, 3)) imageBlob = np.expand_dims(imageBlob, axis=0) # Forward pass through model to make prediction net.setInput(imageBlob) outputs = net.forward() # Index the maximum probability label = param.CLASSES[np.argmax(outputs)] # Show the predicted activity cv2.rectangle(capture, (0, 0), (300, 40), (255, 255, 255), -1) cv2.putText(capture, label, (10, 25), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 0, 0), 2) # Display it on the screen cv2.imshow("Human Activity Recognition", capture) key = cv2.waitKey(1) & 0xFF # Press key 'q' to break the loop if key == ord("q"): break
The program first defines a few constants in a class called
Parameters. You have already used one of the parameters called
VIDEO_PATH while running the application with a webcam. The
SAMPLE_DURATION determines the queue size, which is set to 16. It means that we will be capturing 16 consecutive images before processing those for action detection. The
SAMPLE_SIZE is used for setting the image size while resizing the captured images.
The frames are captured in a double-ended queue of length decided by the
captures = deque(maxlen=param.SAMPLE_DURATION)
We load the model by calling the
readNet method of
net = cv2.dnn.readNet(model=param.ACTION_RESNET)
We capture the video by calling the
VideoCapture method of
vs = cv2.VideoCapture(param.VIDEO_PATH if param.VIDEO_PATH else 0)
We set up an infinite loop for reading the stream.
while True: # Loop over and read capture from the given video input (grabbed, capture) = vs.read()
We resize the captured frame and append it to the previously set queue.
# resize frame and append it to our deque capture = cv2.resize(capture, dsize=(550, 400)) captures.append(capture)
After the queue is filled, we create image blobs from the captured frames:
imageBlob = cv2.dnn.blobFromImages(captures, 1.0, (param.SAMPLE_SIZE, param.SAMPLE_SIZE), (114.7748, 107.7354, 99.4750), swapRB=True, crop=True)
Preprocess the image blob to the model’s required input format.
imageBlob = np.transpose(imageBlob, (1, 0, 2, 3)) imageBlob = np.expand_dims(imageBlob, axis=0)
Feed this processed blob to the model and forward pass through it for the predictions.
net.setInput(imageBlob) outputs = net.forward()
Take the maximum probability as the model’s prediction:
label = param.CLASSES[np.argmax(outputs)]
Display the prediction in the top left corner of the window:
cv2.rectangle(capture, (0, 0), (300, 40), (255, 255, 255), -1) cv2.putText(capture, label, (10, 25), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 0, 0), 2) # Display it on the screen cv2.imshow("Human Activity Recognition", capture) key = cv2.waitKey(1) & 0xFF # Press key 'q' to break the loop if key == ord("q"): break
As the application quits, when the user hits
q on his keyboard, I have not explicitly closed the webcam.
This trivial application has demonstrated the power of cv2 and the pre-trained models to capture human actions. The model currently captures about 400 activities with about 78 to 95% accuracy. Such models can further be trained on specific human actions and may be useful for detecting theft in real-time. You may find several other valuable areas such as keeping an eye on washing hands regularly during the current COVID19 – epidemic, and so on.
Source: Download the project source from our Repository.