Filter
Exclude
Time range
-
Near
Key takeaways from Phil Wong, Head of Capital Markets at SenseTime, during @HSBCโ€˜s Private Bank Roundtable: China's #AI advantage today is increasingly defined by ๐—ฐ๐—ผ๐˜€๐˜, but also ๐—พ๐˜‚๐—ฎ๐—น๐—ถ๐˜๐˜† ๐—ผ๐—ณ ๐—ฝ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜, and in turn the ability to ๐—ฏ๐—ผ๐—ผ๐˜€๐˜ ๐—ฝ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐˜ƒ๐—ถ๐˜๐˜† and ๐—ฒ๐—ป๐—ต๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ฒ๐—ณ๐—ณ๐—ถ๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐˜† for the end client, in order to maximise and optimise economic outcomes for end users. The real differentiator lies in ๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜๐—ถ๐—ป๐—ด ๐—บ๐—ฒ๐—ฎ๐˜€๐˜‚๐—ฟ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ฏ๐˜‚๐˜€๐—ถ๐—ป๐—ฒ๐˜€๐˜€ ๐—ผ๐˜‚๐˜๐—ฐ๐—ผ๐—บ๐—ฒ๐˜€ ๐—ฎ๐˜ ๐˜€๐—ฐ๐—ฎ๐—น๐—ฒ, in addition to just a cost-benefit. How SenseTime is putting this into practice: โ€ข Multimodal๏ผญodel #SenseNova U1 delivers strong performance with a smaller model footprint. โ€ข AI tools are streamlining daily workflowsโ€”such as data analysis and PPT generation with Office #Raccoon, and video production powered by #Seko. โ€ข AI infrastructure, #SenseCore, leverages compute-power co-optimization to reduce energy consumption and improve efficiency. Beyond these, keep an eye on spatial intelligence, world models, and other emerging AI frontiers.
227
Through ๐—ฃ๐˜‚๐—ฏ๐—น๐—ถ๐˜€๐—ต๐—ถ๐—ป๐—ด ๐Ÿฏ.๐Ÿฌ , we have applied our #MultimodalModel to help publishers in Hong Kong and the Chinese Mainland transform content into multilingual #eBooks and #audiobooks. This initiative supports publishers in reaching international markets and unlocks new opportunities for #IP commercialization. At a recent Sharing Session, Lewis Fung, Managing Director of SenseTime Hong Kong and Macau, outlined how we have leveraged #AI over the past year to streamline publishing workflows and improve #translation quality. He noted: โ€œSenseTime is proud to support Publishing 3.0 , which helps Hong Kong connect #culture, #technology, and global markets, strengthening its role as an international hub for IP trading and cultural exchange.โ€ Hong Kong is home to SenseTimeโ€™s headquarters and its key R&D centre. We are committed to leveraging its internationalization advantages to empower industries to thrive.
7
10
480
Apr 29
SenseTime officially releases and open-sources its new SenseNova U1 series models. On April 25, 2026, SenseTime (ๅ•†ๆฑค็ง‘ๆŠ€) officially launched and open-sourced the SenseNova U1 series, its latest and most powerful generation of large models. The U1 series features major advancements in: - Multimodal understanding and generation (text, image, video) - Complex reasoning and agentic capabilities - Long-context processing - High-efficiency performance SenseTime stated that the models are being made openly available to promote broader AI innovation and allow developers and researchers worldwide to build upon them. The release strengthens SenseTimeโ€™s position in the highly competitive Chinese large model landscape and adds another strong open-source option alongside models from DeepSeek, Alibaba, Moonshot, and Zhipu AI. #SenseTime #SenseNovaU1 #OpenSourceAI #ๅ•†ๆฑค็ง‘ๆŠ€ #MultimodalModel #ChinaAI #LargeModelRelease #AIagent #SenseNova
36
#Term: #MultiModalModel "A multi-modal model is a system capable of processing, understanding and generating information across multiple types of data - known as 'modalities' (such as text, images, audio, video, and sensory data) - simultaneousl... with.ga/ld3a7
1
3
#Term: #MultiModalModel "A multi-modal model is a system capable of processing, understanding and generating information across multiple types of data - known as 'modalities' (such as text, images, audio, video, and sensory data) - simultaneousl... with.ga/ld3a7
1
7
๐Ÿšจ Call for Papers โ€“ CVPR 2026 "World Models" Workshop @CVPR We are excited to announce the Call for Papers for our CVPR 2026 Workshop on "๐ŸŒWorld Models Meet Active Sensing and Closed-Loop Planning". ๐Ÿ”— beckschen.github.io/cvpr26wmโ€ฆ ๐Ÿ“ Location: CVPR 2026, Denver, USA ๐Ÿ“… June 3-4, 2026 This workshop aims to bring together researchers from computer vision, robotics, and embodied AI to explore new frontiers in world modeling. Invited Speakers: @YAloimonos @chelseabfinn C. Karen Liu, Jitendra Malik, @NickRoy_MIT ๐Ÿ“Œ Topics include (but are not limited to): โ€ข World Models โ€ข Active Sensing โ€ข Embodied Planning โ€ข Robotics โœจLambda @LambdaAPI will sponsor the awards, including one Best Paper Award ($3,000 in compute credits), two Runner-Up Awards ($1,500 in compute credits each), and $400 in compute credits for each accepted paper. Powered by an amazing organizing team๐Ÿ’ฅ @jieneng_chen @tianminshu @du_yilun Sanjeev Khudanpur, Cheng Peng, Rama Chellappa, @_Chen_Wei_ , Alan Yuille #CVPR2026 #WorldModel #ClosedLoopPlanning #Agents #EmbodiedAI #Robotics #ActiveSensing #VLA #MultimodalModel
9
35
4,971
If you're at #WACV2026, come visit our CVP poster! ๐Ÿ“„ arxiv.org/pdf/2512.08135 [Poster session] ๐Ÿ—“๏ธSun, Mar 8, 2026 โ€ข 4:00 PM โ€“ 5:45 PM MST ๐Ÿ“Tucson Ballroom & Prefunction Space 84 Our authors will be there and are happy to chat about spatial reasoning, multimodal models, and vision-inspired architectures. ๐Ÿ‘‹ @wacv_official @mlpcucsd @LambdaAPI #spatialReasoning #MultimodalModel #VLM #3dvision
1
3
405
๐Ÿš€ Excited to share our #WACV2026 paper for 3D spatial reasoning: CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning Inspired by human vision, we introduce CVP, which combines: ๐Ÿ‘๏ธTarget-affinity tokens (central vision) to focus on relevant objects ๐ŸŒAllocentric grids (peripheral vision) to capture global scene context This simple idea significantly improves 3D spatial reasoning, achieving SOTA performance across multiple benchmarks. ๐Ÿ“„Paper: arxiv.org/pdf/2512.08135 ๐ŸŒPage: zeyuan-chen.com/cvp/ #spatialReasoning #MultimodalModel #VLM @LambdaAPI @UCSD @mlpcucsd @wacv_official
5
75
3,903
Bro hat keine Ahnung und รคuรŸert sich trotzdem. Wahrscheinlich ein wiederkehrendes Muster bei ihm. GPT ist ein Multimodalmodel und kann sehr wohl Bildanalysen durchfรผhren.
13
๐ŸŒŸ๐Œ๐ฎ๐ฅ๐ญ๐ข๐ฆ๐จ๐๐š๐ฅ ๐€๐ˆ ๐ข๐ฌ ๐š๐œ๐œ๐ž๐ฅ๐ž๐ซ๐š๐ญ๐ข๐ง๐  ๐ญ๐ก๐ž ๐ซ๐จ๐š๐ ๐ญ๐จ ๐‹๐ž๐ฏ๐ž๐ฅ ๐Ÿ“ ๐ข๐ง ๐ฏ๐ž๐ก๐ข๐œ๐ฅ๐ž ๐š๐ฎ๐ญ๐จ๐ง๐จ๐ฆ๐ฒ๐Ÿš— Read more: na2.hubs.ly/y0-Sb90 #LTSGDS #MultimodalAI #Autonomousdriving #AVsystem #Multimodalmodel
1
20
ใ‚จใƒณใ‚ฟใƒกใ‹ใ‚‰ๆ–™็†ใ€ๅ—ไป˜ใ€ๆŽฅๅฎขใพใงใ“ใชใ™ ๆฑŽ็”จไบŒ่ถณๆญฉ่กŒไบบๅž‹ใƒญใƒœใƒƒใƒˆ youtu.be/9dvygD4G93c #bipedal #humanoid #robot #GeneralPurposeRobot #DeepReinforcementLearning #ImitationLearning #VLM #MultimodalModel #AgiBot
1
11
1,454
26 Aug 2025
Metaverse vs. Artificial General Intelligence - Chamath Palihapitiya and Lex Fridman #agi #multimodalmodel #humanintelligence
28
๐Ÿš€ Step 3 is now open source! @StepFun_ai officially releases its next-gen multimodal reasoning model to the world - with major breakthroughs in performance and efficiency. ๐Ÿ’ฌHow the tech community is reacting? Check out the discussion on Zhihu: zhihu.com/question/193239477โ€ฆ ๐Ÿ’กCo-founder Yibo Zhu also shared an in-depth breakdown of the system design before: zhuanlan.zhihu.com/p/1932920โ€ฆ #Step3 #OpenSource #MultimodalModel

31 Jul 2025
๐Ÿš€ Announcing Step 3: Our latest open-source multimodal reasoning model is here! Get ready for a stronger, faster, & more cost-effective VLM๏ผ ๐Ÿ”ต 321B parameters (38B active), optimized for top-tier performance & cost-effective decoding. ๐Ÿ”ต Revolutionary Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD) enable efficient inferenceโ€”even on modest GPUs. ๐Ÿ”ต Trained on 20T tokens (incl. 4T multimodal), with meticulous data curation ensuring reduced hallucinations & robust reasoning across vision and language. ๐Ÿš„ Unmatched speed: Up to 4,039 tokens/sec/GPUโ€”70% faster than DeepSeek-V3 under similar conditions. ๐Ÿ’Ž Step 3 sets a new Pareto frontierโ€”bridging power, efficiency, and practicality. ๐Ÿ‘‰ Start building with Step 3 today: huggingface.co/stepfun-ai/stโ€ฆ ๐Ÿ‘‰More details on our research blog๏ผš stepfun.com/research/zh/stepโ€ฆ
10
329
17 Jun 2025
Agree or not with @OfficialLoganK POV ? Share your thoughts. I kind of fall on agree side more, but not fully. #AGI #MultiModalModel #LLMs #AI #ML #ArtificialIntelligence #Memory #ContextManagement #FutureOfAI #WorldModel
64
18
ไบบ้–“ใจ่‡ช็„ถใชใ‚„ใ‚Šๅ–ใ‚ŠใŒใงใใ‚‹ไบŒ่ถณๆญฉ่กŒไบบๅž‹ใƒญใƒœใƒƒใƒˆ ่‡ช่ปข่ปŠใ‚„ใƒ›ใƒใƒผใƒœใƒผใƒ‰ใซไน—ใ‚‹ใ“ใจใ‚‚ใงใใ‚‹ youtu.be/iyCjevFGLiA #bipedal #humanoid #robot #GeneralPurposeRobot #DeepReinforcementLearning #ImitationLearning #VLM #MultimodalModel #AgiBot #LingxiX2
4
88
323
17,964
When moving from just producing and transcoding video (and other modalities) into training a model, you need a well-defined data layout, a preprocessing pipeline, and a training loop that efficiently streams data through the GPU without excessive memory transfers. Below is a conceptual, end-to-end approach that integrates all these concepts: 1. Data Organization and Labeling A common approach for supervised training is to organize your dataset into a directory structure that encodes labels in folder names. Assume you have a dataset directory with train, val, and test splits, and each split contains subdirectories for each class label. Since you have multiple modalities (camera video, sound, LiDAR, radar, and even unknown sensors), store them in a systematic manner per sample: dataset/ train/ classA/ sample_000/ video.mp4 audio.wav lidar.bin radar.bin sensorX.data sample_001/ video.mp4 audio.wav lidar.bin radar.bin sensorX.data ... classB/ ... val/ ... test/ ... Rationale: Each sampleโ€™s modalities are grouped together in a single folder. Class labels come from the parent folder (e.g., classA). You can add a metadata file (e.g., metadata.json) to store timestamps, frame rates, or calibration data for LiDAR/radar if needed. 2. Preprocessing and Synchronization Before training, data often needs preprocessing. You might need to: Decode and preprocess video frames using FFmpeg with GPU acceleration. Extract or transform audio into spectrograms. Convert LiDAR point clouds into a structured tensor (like a voxel grid, or a depth/image-like representation). Represent radar data similarly (e.g., heatmap or text-based messages turned into a small image or embedding). GPU Memory and Data Transfers: To minimize CPU-GPU round trips, consider these steps: Video Preprocessing: Use a GPU-accelerated FFmpeg command: ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i input_video.mp4 \ -vf "hwupload_cuda,scale_cuda=640:480:format=yuv420p" \ -c:v rawvideo -f rawvideo pipe:1 This outputs preprocessed frames directly from the GPU pipeline. If you must store them for training, you might hwdownload at the final step, but ideally keep them in a GPU-friendly compressed format (like a smaller-size H.264 or a sequence of images in NV12 format). Audio to GPU: Audio doesnโ€™t decode to GPU memory as easily since itโ€™s not GPU-accelerated by default. Convert audio to a log-mel spectrogram or another feature offline. Store it as a numpy .npy file (CPU memory). During training, you can load and optionally upload it to GPU. LiDAR/Radar/Unknown Sensors: Convert these sensor modalities into 2D/3D tensors. For example, LiDAR point clouds can be rasterized into a birdโ€™s-eye-view image. Radar data can be turned into a range-Doppler map image. Perform these conversions offline or on-the-fly during training with efficient CPU/GPU augmentation libraries. If these are large, consider tiling or streaming them in smaller chunks and reassemble only whatโ€™s needed. 3. Dataset Preparation for Training Once the preprocessing is done, you might have: Compressed or frame-extracted video data stored in a GPU-friendly codec or as pre-processed tensors on disk. Audio spectrograms stored as .npy arrays. LiDAR and radar processed into image-like tensors or .npy arrays. Unknown sensors also converted to a known tensor format. Now you have a consistent set of input tensors per sample. A typical training input pipeline might look like this (in Python/PyTorch, as an example): class MultiModalDataset(torch.utils.data.Dataset): def __init__(self, root_dir, split='train', transform=None): # Index all samples and their modalities self.samples = self._load_samples(root_dir, split) self.transform = transform def _load_samples(self, root_dir, split): # Traverse `root_dir/split/classX/` and index all samples # Return a list of tuples: (video_path, audio_path, lidar_path, radar_path, label) pass def __getitem__(self, idx): sample = self.samples[idx] # Load each modality: video_tensor = self._load_video(sample['video_path']) # Possibly a GPU-decoding step if integrated audio_tensor = np.load(sample['audio_npy']) # CPU load, then torch.tensor() lidar_tensor = np.load(sample['lidar_npy']) radar_tensor = np.load(sample['radar_npy']) sensorX_tensor = np.load(sample['sensorX_npy']) # Convert to torch tensors audio_tensor = torch.from_numpy(audio_tensor) lidar_tensor = torch.from_numpy(lidar_tensor) radar_tensor = torch.from_numpy(radar_tensor) sensorX_tensor = torch.from_numpy(sensorX_tensor) # If transform, apply here (normalize, augment) if self.transform: # apply any data augmentations pass label = sample['label'] return (video_tensor, audio_tensor, lidar_tensor, radar_tensor, sensorX_tensor), label def __len__(self): return len(self.samples) Memory Considerations: If the video is stored in a GPU-friendly compressed format, you might integrate custom code that uses the FFmpeg libraries to decode frames directly to GPU memory, returning a GPU tensor. This avoids CPU-GPU copies. If thatโ€™s too complicated, just decode on CPU and tensor.to(device) once per batch. For large modalities, consider partial loading or streaming (e.g., only load the LiDAR segment you need per batch). Tiling large inputs into patches and processing them asynchronously can help. 4. Training Loop with Optimized Memory Usage During training: Use a DataLoader with num_workers>0 to parallelize data loading on the CPU side. Use pinned (page-locked) memory for DataLoader if available (pin_memory=True in PyTorch) to speed CPU-to-GPU transfers. Preallocate GPU tensors if your shapes are fixed, to reduce re-allocation costs each iteration. If using large frames or high-resolution, consider downscaling or partial processing as part of the transform pipeline. Example Training Command (Pseudocode): dataset = MultiModalDataset(root_dir='dataset', split='train', transform=some_transform) dataloader = torch.utils.data.DataLoader(dataset, batch_size=8, shuffle=True, num_workers=4, pin_memory=True) model = MultiModalModel() # A model that takes video audio lidar radar sensorX model.to('cuda') optimizer = torch.optim.Adam(model.parameters()) criterion = torch.nn.CrossEntropyLoss() for epoch in range(num_epochs): model.train() for (video, audio, lidar, radar, sensorX), label in dataloader: # Move to GPU video = video.to('cuda', non_blocking=True) audio = audio.to('cuda', non_blocking=True) lidar = lidar.to('cuda', non_blocking=True) radar = radar.to('cuda', non_blocking=True) sensorX = sensorX.to('cuda', non_blocking=True) label = label.to('cuda', non_blocking=True) optimizer.zero_grad() output = model(video, audio, lidar, radar, sensorX) loss = criterion(output, label) loss.backward() optimizer.step() 5. Integrating GPU-Accelerated Transcoding with Training If you need on-the-fly transcoding or augmentation at training time (e.g., random resizing or cropping video on the GPU), you can: Use FFmpegโ€™s GPU pipeline and a named pipe or shared memory. Your __getitem__ might call a function that runs an FFmpeg command line (or uses libav libraries) to decode a portion of the video directly into GPU memory. This would be more complex but can be done by writing custom C code or a Python extension that interfaces with the FFmpeg and CUDA APIs directly. The key is to keep data on GPU as long as possible, applying filters and scaling before handing the frames to PyTorch. 6. Handling Unknown Problem Gradients You mentioned being memory inefficient to optimize against an unknown problem gradient. This might mean experimenting with different input sizes, modalities, or GPU tiling strategies: Try first a straightforward approach (decode on CPU, transfer to GPU) and measure performance. If performance is insufficient, implement tiling: break large inputs into smaller chunks, process them asynchronously, and measure again. Adjust batch sizes, resolution, or tile sizes dynamically. Because the problem gradient is unknown, implement flexible code that can quickly switch between these strategies. 7. Summary A standardized folder structure: split/class/sample/ with multiple modalities per sample. Preprocess all modalities offline into formats that are easy to load during training (e.g., .npy for non-video, GPU-compatible videos). Use FFmpeg with GPU acceleration for video transformations before or during training. Implement a training dataset and DataLoader that can handle multiple modalities efficiently. Use pinned memory, asynchronous transfers, tiling, and GPU-friendly formats to reduce CPU-GPU bottlenecks. Continuously profile and adjust pipeline to handle unknown performance issues. This integrated approach moves from raw multimodal data on disk, through GPU-accelerated preprocessing with FFmpeg, into a training loop that minimizes memory transfers and can adapt to complex, unknown bottlenecks by adjusting strategies like tiling and streaming. Below is a more concrete, integrated scenario combining training, multiple modalities (video and others), and maintaining control via a running producer pipeline that provides data through shared memory or pipes. The goal is to create a setup where you can train a model directly on streaming data from a producer process, while also having static datasets on disk. This lets you adaptively control what frames, modalities, or segments you feed into the training loop in real-time. Key Points: 1. Producer-Consumer Setup Using Shared Memory or Named Pipes: We previously described using named pipes or shared memory (via shm_open, mmap) to pass decoded frames or preprocessed data from a producer to a consumer. Now we integrate that into the training loop: The producer (an FFmpeg-based pipeline, plus possibly custom code) runs continuously, decoding live video and converting LiDAR, radar, and unknown sensor data into a standardized tensor form. This producer writes data into a shared memory region or pipe. The training process (consumer) reads from this shared memory or pipe to get fresh training samples. By doing so, you have real-time control: you can send commands to the producer to change filters, modalities, or subsets of the data on the fly, and the training loop will adapt to whatever data comes through. 2. Hybrid Approach: Disk Live Feed: Your dataset may have a standard directory structure for historical data, as outlined before. You can load from disk for the bulk of your training samples. Additionally, insert a special โ€œliveโ€ modality or sample entry that reads from the producer in memory. This gives you a hybrid scenario: Most samples: static data from disk (preprocessed .npy, .mp4, etc.) Some samples: live data from the producer pipeline (video frames, sensor arrays) read directly from shared memory. 3. Shared Memory Data Flow: The producer uses FFmpeg with GPU acceleration to decode and process frames. After processing (e.g., scaling video, converting LiDAR to an image, etc.), it writes the final tensors to a shared memory region. This shared memory can contain a header that indicates the shape, modality types, and a frame counter. Another region might store raw pixel or floating-point data. Semaphores or atomic flags signal when a new frame is ready. The training process waits on a semaphore from the producer indicating a new sample is ready, then reads the data, converts it to a tensor, and feeds it into the training loop. 4. Code Sketch (Conceptual, Not Full Production Code): Producer Side (C/C ): // Pseudocode: producer writes a single multimodal sample (video frame sensor arrays) to shared memory. // This can be integrated with FFmpegโ€™s decoding pipeline as shown before. struct sample_header { int frame_number; int video_width; int video_height; int video_channels; // e.g. 3 for RGB int lidar_width, lidar_height; // if representing LiDAR as image int radar_size; // arbitrary int sensorX_size; // arbitrary // possibly more fields... }; // Assume we have mapped shared memory region and semaphores as previously described. // After decoding and preparing a frame, and other modalities: sample_header *hdr = (sample_header *)shared_mem_base; unsigned char *data_ptr = (unsigned char*)(hdr 1); // Fill hdr with metadata hdr->frame_number = current_frame_number; hdr->video_width = 640; hdr->video_height = 480; hdr->video_channels = 3; hdr->lidar_width = 200; hdr->lidar_height = 200; hdr->radar_size = 1024; hdr->sensorX_size = 512; // Copy video frame data (e.g. 640*480*3 bytes) into data_ptr memcpy(data_ptr, video_frame_data, 640*480*3); data_ptr = 640*480*3; // Copy LiDAR data memcpy(data_ptr, lidar_image_data, 200*200); data_ptr = 200*200; // Copy radar data memcpy(data_ptr, radar_data, 1024); data_ptr = 1024; // Copy sensorX data memcpy(data_ptr, sensorX_data, 512); // Signal to consumer that a new sample is ready: sem_post(producer_sem); Consumer (Training) Side (Python with PyTorch): import torch import numpy as np import mmap import os from torch.utils.data import Dataset, DataLoader class LiveMultimodalDataset(Dataset): def __init__(self, disk_root, live_shared_mem_path, use_live_feed=True): self.disk_samples = self._index_disk(disk_root) self.use_live_feed = use_live_feed # Map shared memory self.mem_fd = os.open(live_shared_mem_path, os.O_RDWR) # Suppose we know total_size from configuration total_size = 640*480*3 200*200 1024 512 1024 # just example self.mmap_obj = mmap.mmap(self.mem_fd, total_size sizeof_header, mmap.MAP_SHARED, mmap.PROT_READ|mmap.PROT_WRITE) # Semaphores or signals handled externally, we assume a function wait_for_sample_ready() def _index_disk(self, root): # scan folder structure and return list of static samples samples = [] # ... return samples def __len__(self): return len(self.disk_samples) (1 if self.use_live_feed else 0) def __getitem__(self, idx): if self.use_live_feed and idx == len(self.disk_samples): # read from live feed self.wait_for_sample_ready() # wait on a semaphore or event from producer hdr = self._read_header() data = self._read_data(hdr) # Convert data to tensors video_tensor = torch.from_numpy(data['video']).float() lidar_tensor = torch.from_numpy(data['lidar']).float() radar_tensor = torch.from_numpy(data['radar']).float() sensorX_tensor = torch.from_numpy(data['sensorX']).float() # Example label: might come from an external source or a default label label = 0 return (video_tensor, lidar_tensor, radar_tensor, sensorX_tensor), label else: # load from disk sample = self.disk_samples[idx] # load static npy or mp4 data similarly # ... return (video_tensor, lidar_tensor, radar_tensor, sensorX_tensor), label def _read_header(self): self.mmap_obj.seek(0) # read sample_header fields (assume binary struct) # parse frame_number, sizes... # return a dict or object with metadata pass def _read_data(self, hdr): # read data arrays from shared memory according to hdr sizes self.mmap_obj.seek(sizeof_header) video_size = hdr.video_width * hdr.video_height * hdr.video_channels video_data = np.frombuffer(self.mmap_obj.read(video_size), dtype=np.uint8).reshape(hdr.video_height, hdr.video_width, hdr.video_channels) lidar_size = hdr.lidar_width * hdr.lidar_height lidar_data = np.frombuffer(self.mmap_obj.read(lidar_size), dtype=np.uint8).reshape(hdr.lidar_height, hdr.lidar_width) radar_data = np.frombuffer(self.mmap_obj.read(hdr.radar_size), dtype=np.uint8) sensorX_data = np.frombuffer(self.mmap_obj.read(hdr.sensorX_size), dtype=np.uint8) return {'video': video_data, 'lidar': lidar_data, 'radar': radar_data, 'sensorX': sensorX_data} def wait_for_sample_ready(self): # block until producer_sem signals a new sample pass # Now, training code: dataset = LiveMultimodalDataset(disk_root='dataset', live_shared_mem_path='/dev/shm/myshared', use_live_feed=True) dataloader = DataLoader(dataset, batch_size=4, shuffle=True) model = MultiModalModel() # hypothetical model model.cuda() optimizer = torch.optim.Adam(model.parameters()) criterion = torch.nn.CrossEntropyLoss() for epoch in range(10): for (video, lidar, radar, sensorX), label in dataloader: video = video.cuda(non_blocking=True) lidar = lidar.cuda(non_blocking=True) radar = radar.cuda(non_blocking=True) sensorX = sensorX.cuda(non_blocking=True) label = label.cuda(non_blocking=True) optimizer.zero_grad() output = model(video, lidar, radar, sensorX) loss = criterion(output, label) loss.backward() optimizer.step() 5. Dynamic Control Over the Producer: The producer can listen to commands (through another pipe or shared memory) and change what itโ€™s writing. For example: Command: โ€œSwitch video to grayscaleโ€ Command: โ€œUse LiDAR from a different sensorโ€ Command: โ€œChange radar processing methodโ€ The producer applies these changes, and the training loop automatically sees the different data in subsequent samples. This gives you end-to-end control: Start producer with FFmpeg custom code to decode and process all modalities in GPU memory, then write to shared memory. Producer can be commanded at runtime to alter filters or select different time segments of video. The training process continuously reads from both disk (for stable reference data) and from the live producer feed (for dynamic, real-time data) and trains the model. 6. Managing GPU vs CPU Memory Transfers: If you decode and preprocess on the GPU, you may still need to hwdownload to CPU for shared memory writing, since shared memory is accessible by CPU. If efficiency is paramount, consider using CUDA-IPC or GPU-aware shared memory (complex and platform-specific). Another approach: produce and consume entirely on the GPU if possible. Use CUDA inter-process communication (IPC) to share GPU memory buffers between producer and consumer processes. This is advanced and not directly supported by FFmpeg CLI, so you might implement custom code linking libav* libraries with CUDA IPC. For simplicity, the above code sticks to CPU shared memory. You can tile data or compress it before writing to reduce overhead. If frames are huge, tile them into chunks and process incrementally. Conclusion: This refined approach incorporates training directly on live pipeline data along with static datasets, gives you runtime control over the input via producer commands, and integrates multiple modalities. The final design involves: A producer process that decodes, processes, and places data into shared memory. A consumer (training) process that reads from both static disk-based datasets and the live shared memory feed. Control channels to send commands to the producer, altering the data that appears in the training loop. The ability to adapt strategies for memory handling, such as partial tiling or CUDA IPC, if needed.

1
57
20 Nov 2024
Pixtral Large, the newly released multi-modal open model, is gaining attention for its SOTA performance, comparable to GPT-4o. A quick trial shows the model capable of Cantonese conversation and Chinese text OCR/understanding in images. It looks promising. #MultiModalModel
2
1
344
13 Nov 2024
Voyage AI Introduces voyage-multimodal-3: A New State-of-the-Art for Multimodal Embedding Model that Improves Retrieval Accuracy by an Average of 19.63% itinai.com/voyage-ai-introduโ€ฆ #VoyageAI #MultimodalModel #DocumentRetrieval #DeepLearning #AIInnovation #ai #news #llm #ml #reseaโ€ฆ
97
Replying to @CGTNOfficial
A Chinese developer has launched a multimodal model that integrates video, image, and text, marking a significant advancement in artificial intelligence. This model aims to enhance content creation and analysis by enabling seamless interaction between different media types, improving tasks like video summarization, image captioning, and text analysis. The development reflects the growing trend in AI towards creating more comprehensive and versatile tools that can understand and process various forms of data. #AI #MultimodalModel #Video #Image #Text #ArtificialIntelligence
The Parable of the Good Samaritan 1/3 The parable of the Good Samaritan teaches us to love thy neighbor as thyself.
67