Hacker-in-residence @voxel51. open source contributor. shipping fiftyone integrations. cvpr is better than neurips

Joined April 2020
1,026 Photos and videos
harpreet retweeted
Jun 13
Replying to @benjaminmmurphy
>be anthropic >build the most powerful AI >release it >publish essay asking govt to block dangerous AI >government blocks only YOU
10
174
35,460
your hand avatar model was trained on 10 subjects. it can reconstruct the hands it's seen. it can't generalize to a new person from a single photo PALM is 13,000 high-quality 3D hand scans from 263 subjects with 90K calibrated multi-view images. diverse skin tones, ages, and geometry with physically based materials for relighting. no prior dataset has had real 3D scans, high-res multiview imagery, and subject diversity at the same time grouped each subject's multi-view images with their 3D hand scan in fiftyone so you can explore the geometry and appearance variation across the full population huggingface.co/datasets/Voxe…
3
28
2,525
harpreet retweeted
Today we're releasing Perceptron Agentic Detection: localize anything you can describe in natural language or show examples of.
2
13
44
10,846
harpreet retweeted
Level up your computer vision workflows with a free hands-on workshop for your team! Book a workshop: hubs.ly/Q04kQxh40 These hands-on workshops are delivered by Voxel51 computer vision experts. Both virtual and in-person formats. * 60 min virtual workshop * Half-day onsite workshop * Full-day onsite workshop and hackathon #mcp #skills #computervision #ai #artificialintelligence #machinevision #machinelearning #physicalai
1
1
59
this is what terrain classification looks like when there are no roads STONE from ICRA 2026: 6 surround cameras, 128-channel lidar, and voxel-level traversability labels on off-road terrain. every point classified as free, traversable, potentially traversable, or non-traversable grouped all 6 cameras 3D lidar with traversability coloring and ego trajectory in fiftyone #ICRA2026 #CVPR2026 huggingface.co/datasets/Voxe…
11
72
4,563
this is what 195° field of view looks like. your depth model was trained on 60° WideDepth from ICRA 2026: millimeter-accurate depth ground truth for fisheye cameras across 101 indoor scenes. rendered from high-res lidar, not estimated grouped fisheye, panoramic, and cropped views in fiftyone with depth heatmaps and 3D point clouds backprojected from the ground truth huggingface.co/datasets/Voxe… #ICRA2026 #CVPR2026
9
149
8,126
harpreet retweeted
wow, vggt-omega is dope. might just take best paper at cvpr again
1
3
9
367
harpreet retweeted
what a 3D reconstruction of a transparent object scene looks like when you back-project 121 frames of diffusion-estimated depth into a point cloud built from TransPhy3D — each sequence has RGB video depth normals camera calibration, all grouped in FiftyOne #ICRA2026
2
3
6
1,122
pick up the mug' is an object problem. 'pick up the mug by the handle' is a part problem. most 3D datasets solve the first one. almost none solve the second PartScan from PinPoint3D: 1,509 scene-level 3D scans with dense per-point part segmentation across 707 scenes. no manual annotation, fully synthesized pipeline on real-world-style geometry parsed it into fiftyone as interactive 3D point clouds. every point colored by its part label huggingface.co/datasets/Voxe…
1
6
26
2,389
ADAS research is built on sedans on California highways. most road deaths happen on motorcycles in Mumbai MOTOR is the first large-scale multimodal dataset for two-wheeler rider behavior. 1,629 sequences, 25 hours of video from 16 riders in dense unstructured Indian traffic. front camera, rear camera, helmet camera, rider eye-gaze from wearable trackers, on-road audio, and full telemetry (GPS, accelerometer, gyroscope). all synchronized i grouped the three camera views together in fiftyone so you can see what the road looks like, what's behind the rider, and where the rider is looking, all at the same time huggingface.co/datasets/Voxe…
3
16
1,454
my favorite part of all the #CVPR2026 posts is seeing the orange @voxel51 lanyards
1
1
6
179
a self-driving car doesn't see the world through one sensor. it sees it through nine cameras, seven lidars, and three radars simultaneously. the hard part isn't collecting the data. it's exploring all of it in sync KITScenes Multimodal is a full robotaxi sensor suite captured in Frankfurt: > 360° camera coverage > fused lidar/radar point clouds > Lanelet2 HD maps > projected depth > ego trajectory > instance predictions. > all synchronized at 10 Hz i grouped all 9 camera views the fused 3D lidar point cloud together in fiftyone so you can flip between any camera angle, the lidar depth overlay, the HD map lanes, and the 3D scene for any frame check it out here: huggingface.co/datasets/Voxe…
14
84
6,090
count the dogs' seems like a simple task for a vision model. but what if there are 3 golden retrievers and 2 poodles? if you asked for 'golden retrievers' and the model returns 5, it can count but it can't follow your prompt KubriCount redefines object counting as a multi-grained problem: > identity > attribute > instance type > category > abstract concept these are all different questions even when they look the same their benchmark shows both MLLMs and specialist counting models fail badly at fine-grained distinctions. 11.7K images, per-pixel segmentation, bounding boxes, and the most comprehensive counting annotations to date parsed it into fiftyone with detections and instance segmentation for every sample come get hands on with the datsaet if you're at #CVPR2026, visit booth 309 if you want to dig into the dataset and learn about an interesting experiement i did using the latest LocateAnything model from nvidia or if you're at home, you can get hands on here: huggingface.co/datasets/Voxe…
2
10
1,235
harpreet retweeted
EvoLogics uses Voxel51’s FiftyOne Enterprise to prepare the right training data and evaluate perception models for autonomous subsea missions, giving the team greater control and speed from data to production. Learn more about how FiftyOne can optimize your computer vision workflows by booking a demo: voxel51.com/sales Learn more about Evologics: evologics.com/ EvoLogics designs and manufactures underwater communication systems, positioning networks, and autonomous vehicles for the world's most demanding subsea environments. Based in Germany, the company has built technology for clients across commercial, offshore, defense, and research sectors. Operating at depths and in conditions unreachable by conventional means, EvoLogics systems locate survivors in search and rescue operations, inspect subsea pipelines and infrastructure, monitor ocean environments, and support naval mine countermeasure activities. Their systems process multimodal data, including sonar data from various types, underwater and surface-mounted cameras, LiDARs, and hydrophones, in real time, during live missions. #computervision #ai #artificialintelligence #machinevision #machinelearning #datascience #physicalai #mcp #agents
2
2
130
static 3D reconstruction is mostly solved. dynamic scenes, where objects move and people walk around, that's still an open problem. the bottleneck is data: you need multiple synchronized cameras capturing the same moment from different angles with dense ground truth Syn4D is a fully synthetic multiview dataset built for this. 8 synchronized cameras, Unreal Engine 5, per-frame depth maps, instance segmentation, camera poses, and natural language captions across offices, warehouses, and hospitals i grouped the 8 camera views together in fiftyone with 3D point cloud reconstructions so you can flip between any camera angle, the depth and segmentation overlays, and the fused 3D scene for any sequence check out the dataset here: huggingface.co/datasets/Voxe… btw if you're at ICRA next week hmu or come by booth or swing by booth B081 and say hi #ICRA2026
1
15
104
6,940
great work from @CodyJzr 3d point cloud reconstruction wasn't part of the original Syn4D dataset, but it was possible to reconstruct it from the ground-truth annotations that were included: > Read per-frame depth (float32 EXR), RGB images, and per-frame camera intrinsics extrinsics (focal length, sensor size, position, yaw/pitch/roll) from all 8 synchronised camera views > Applied sRGB gamma correction to the linear-space RGB renders so colours display correctly > Back-projected each valid depth pixel into a shared Unreal Engine world coordinate system using the standard pinhole camera model, converting the result from centimetres to metres > Coloured each 3D point from its corresponding RGB pixel, merged all 8 views, then voxel-downsampled and removed statistical outliers to produce a clean cloud per sequence
1
2
400
robots can't grasp transparent objects because depth sensors can't see them glass, bottles, clear containers just disappear ClearDepth is a stereo dataset built for this problem: left/right video pairs with ground truth depth, surface normals, instance segmentation, and camera poses across 204 indoor scenes i built reconstructed point cloud reconstructions for each scene and grouped everything in fiftyone so you can flip between the stereo views, the dense labels, and the 3D reconstruction side by side #ICRA2026
1
18
78
5,627
the 3d reconstruction wasn't part of the original dataset but i was able to reconstruct it by: > reading per-frame exr depth, rgb images, and camera poses from each scene >back-project depth pixels into a shared world coordinate system >color each 3d point from the RGB image and merge frames into one cloud check out the dataset here: huggingface.co/datasets/Voxe… if you're at icra this year hmu, or swing by booth B081 and say hi
1
5
217
harpreet retweeted
Level up your computer vision workflows with a free hands-on workshop for your team! Book a workshop: hubs.ly/Q04j2Pkl0 These hands-on workshops are delivered by Voxel51 computer vision experts. Both virtual and in-person formats. * 60 min virtual workshop * Half-day onsite workshop * Full-day onsite workshop and hackathon #mcp #skills #computervision #ai #artificialintelligence #machinevision #machinelearning #physicalai
1
1
93
what a 3D reconstruction of a transparent object scene looks like when you back-project 121 frames of diffusion-estimated depth into a point cloud built from TransPhy3D — each sequence has RGB video depth normals camera calibration, all grouped in FiftyOne #ICRA2026
2
3
6
1,122