GLP Depth Estimation

Monocular depth estimation means deducing the distance or depth information of objects within a scene from a single 2D image or video frame. This process facilitates comprehension of the spatial arrangement of the environment and the relative separations between objects. When applied to drones, monocular depth estimation plays a crucial role in autonomous navigation, obstacle avoidance, and scene understanding. By extracting depth cues from the drone’s camera feed, the drone gains the ability to perceive its surroundings, gauge the elevation of obstacles or landmarks, and make informed decisions to ensure secure and efficient flight. The method proposed in GLP is utilized for this purpose. It consists of an encoder part implemented with transformers, possessing a large receptive field for capturing global context. The decoder part aims to capture local features to preserve structural details and generate a detailed feature map facilitated by skip connections and a selective feature fusion model that employs attention maps to pinpoint distinct features. This approach achieved state-of-the-art results on the NYU Depth V2 dataset, yielding a root mean squared error of 0.344 with minimal parameter usage. An example of the depth estimation model is shown below where the darker parts of the image represent nearer objects while the brighter regions represent more far objects.

Example

from dronevis.models import DepthEstimator

model = DepthEstimator()
model.load_model()
model.detect_webcam()

GLP Depth Estimation Class

class dronevis.models.DepthEstimator

Depth Estimation class with huggingface

Source: https://huggingface.co/docs/transformers/tasks/monocular_depth_estimation

__init__(): Initialize self. See help(type(self)) for accurate signature.

load_model(model_name='vinvino02/glpn-nyu')

Load the model from huggingface model hub

Parameters: model_name (str, optional) – Model name to load. Defaults to “vinvino02/glpn-nyu”.

transform_img(image)

Idel transformation for the input image, since depth estimation model does the transformations internaly during the inference.

Parameters: image (Union[np.ndarray, PIL.Image]) – Input image
Returns: Same as the input image
Return type: np.ndarray

predict(image)

Run model inference on the provided image

Parameters: image (np.ndarray) – Input image for inference
Returns: Predicted image with bounding boxes drawn.
Return type: np.ndarray

detect_webcam(video_index=0, window_name='Depth Estimation')

Run model on a video stream from the webcam

Parameters

video_index (int) – Index of the camera/video device to retrieve stream
window_name (str, optional) – Name of openCV window for running the mpdel.
to "Cam Detection". (Defaults) –