Filter
Exclude
Time range
-
Near
【Gemini Embedding 2】マルチモーダルベクトル生成スクリプトを作成しました。 Gemini Embedding 2を試してみました。 具体的には、サンプルの「テキスト、画像、音声」をインターリーブ入力として与え、1つのエンベディングベクトルを生成するスクリプトを作成・実行しました。 デフォルトの3072次元のベクトルを生成し、ベクトルの先頭5要素などを標準出力するスクリプトとなります。 # 実行結果 以下のように、それらしいベクトル情報が出力されました。 ```zsh ・生成されたベクトルの次元数: 3072 次元 (デフォルト) ・ベクトルの先頭5要素: [-0.0046049873, -0.01866987, 0.012537489, 0.021659078, 0.0051614824] ``` # 今後の展望 1. 本モデルを利用した検索 テスト用のマルチモーダルなベクトルデータベースを構築し、意味検索を行うことを試します。 例えば、キャラクター画像とその説明テキストを組み合わせたマルチモーダルベクトルでデータベースを構築します。 その後、「このBGMの雰囲気に合う画像を探したい」といったテキストと音声データの複合クエリを用いた意味検索を行います。 # 動画タイムライン 0:00 スクリプトの実行画面 0:32 スクリプトの中身 # 関連URL [Gemini Embedding 2 | Google](blog.google/innovation-and-a…) [Gemini Embedding 2 の概要 | npaka](note.com/npaka/n/n1e39b1e675…) # タグ #GeminiEmbedding2 #EmbeddingModel #LLM #Multimodal #GameDev
17
424
LM StudioをScreaming Frogで使うためのCustom JSコード貼っときます。上部のMODEL_IDENTIFIERは使うモデル名に変更。コードはPassage EmbeddingsをLM Studio用に書き換えてます。 // ================================================================================= // Screaming Frog Recipe: Semantic Chunking Local Embeddings via LM Studio // // This script is modified to use a local embedding model (like EmbeddingGemma) // running on LM Studio, which provides an OpenAI-compatible API. // // SETUP REQUIRED: // 1. Install LM Studio on your Mac: lmstudio.ai/ // 2. Download an embedding model (e.g., "EmbeddingGemma") from the search tab. // 3. Go to the "Local Server" tab (server icon on the left). // 4. Select your downloaded embedding model. // 5. Start the server. Keep LM Studio running while you crawl. // // SCRIPT CONFIGURATION: // - Verify the MODEL_IDENTIFIER matches the one shown in LM Studio. // - Verify the LM_STUDIO_URL is correct (default is http://localhost:1234). // ================================================================================= // --- LM Studio Configuration --- // This should match the model identifier shown in your LM Studio server tab. const MODEL_IDENTIFIER = 'google/embedding-gemma-v1.5'; // Default URL for LM Studio's local server. Change the port if you have configured it differently. const LM_STUDIO_URL = 'http://localhost:1234/v1/embeddings'; /** * Get embeddings for an array of texts using the local LM Studio server. * @param {string[]} texts - An array of text strings to embed. * @returns {Promise<number[][]>} A promise that resolves to an array of embedding vectors. */ async function getEmbeddings(texts) { try { const response = await fetch(LM_STUDIO_URL, { method: 'POST', headers: { 'Content-Type': 'application/json', }, // OpenAI-compatible API format body: JSON.stringify({ input: texts, model: MODEL_IDENTIFIER, }), }); if (!response.ok) { const errorText = await response.text(); throw new Error(`LM Studio API error (${response.status}): ${errorText}`); } const data = await response.json(); // The response is a list of objects, each containing an "embedding" array. // We need to extract just the array for each item. return data.data.map(item => item.embedding); } catch (error) { // This often happens if the LM Studio server is not running. console.error('Failed to connect to LM Studio API.', error); throw new Error(`Could not get embeddings from LM Studio: ${error.message}`); } } // --- Original Script Logic (Mostly Unchanged) --- // Configuration for chunking const CONFIG = { minChunkLength: 50, // Minimum characters per chunk maxChunkLength: 500, // Maximum characters per chunk includeMetadata: true, // Include element type and position info batchSize: 5, // Number of chunks to process in parallel retryAttempts: 3, // Retry failed requests retryDelay: 1000, // Delay between retries (ms) }; /** * Extract semantic passages from the webpage DOM * @returns {Array} Array of chunk objects with text and metadata */ function extractSemanticChunks() { const chunks = []; let chunkIndex = 0; const semanticSelectors = [ 'h1, h2, h3, h4, h5, h6', 'p', 'li', 'blockquote', 'article', 'section', 'div[role="main"]', 'td, th', 'figcaption', 'summary', 'dd', ]; semanticSelectors.forEach((selector) => { document.querySelectorAll(selector).forEach((element, index) => { const text = element.textContent?.trim(); if (text && text.length >= CONFIG.minChunkLength) { const textChunks = splitLongText(text, CONFIG.maxChunkLength); textChunks.forEach((chunkText, subIndex) => { chunks.push({ text: chunkText, index: chunkIndex , metadata: CONFIG.includeMetadata ? { elementType: element.tagName.toLowerCase(), elementIndex: index, subChunkIndex: subIndex, totalSubChunks: textChunks.length, xpath: getXPath(element), textLength: chunkText.length, } : null, }); }); } }); }); if (chunks.length === 0) { const bodyText = document.body.textContent?.trim(); if (bodyText) { const textChunks = splitLongText(bodyText, CONFIG.maxChunkLength); textChunks.forEach((chunkText, index) => { chunks.push({ text: chunkText, index: index, metadata: CONFIG.includeMetadata ? { elementType: 'body', elementIndex: 0, subChunkIndex: index, totalSubChunks: textChunks.length, xpath: '/html/body', textLength: chunkText.length, } : null, }); }); } } return chunks; } /** * Split long text into smaller chunks while preserving sentence boundaries * @param {string} text - Text to split * @param {number} maxLength - Maximum length per chunk * @returns {Array} Array of text chunks */ function splitLongText(text, maxLength) { if (text.length <= maxLength) return [text]; const chunks = []; const sentences = text.split(/(?<=[.!?])\s /); let currentChunk = ''; for (const sentence of sentences) { if ((currentChunk sentence).length <= maxLength) { currentChunk = (currentChunk ? ' ' : '') sentence; } else { if (currentChunk) chunks.push(currentChunk); if (sentence.length > maxLength) { const words = sentence.split(' '); let wordChunk = ''; for (const word of words) { if ((wordChunk ' ' word).length > maxLength) { chunks.push(wordChunk); wordChunk = word; } else { wordChunk = (wordChunk ? ' ' : '') word; } } currentChunk = wordChunk; } else { currentChunk = sentence; } } } if (currentChunk) chunks.push(currentChunk); return chunks.filter(chunk => chunk.length >= CONFIG.minChunkLength); } /** * Get XPath for an element * @param {Element} element - DOM element * @returns {string} XPath string */ function getXPath(element) { if (element.id) return `//*[@id="${element.id}"]`; const parts = []; while (element && element.nodeType === Node.ELEMENT_NODE) { let index = 0; let sibling = element.previousSibling; while (sibling) { if (sibling.nodeType === Node.ELEMENT_NODE && sibling.tagName === element.tagName) { index ; } sibling = sibling.previousSibling; } const tagName = element.tagName.toLowerCase(); const pathIndex = index > 0 ? `[${index 1}]` : ''; parts.unshift(`${tagName}${pathIndex}`); element = element.parentNode; } return parts.length ? `/${parts.join('/')}` : ''; } /** * Process chunks in batches with retry logic * @param {Array} chunks - Array of chunk objects * @returns {Promise} Promise resolving to embedded chunks */ async function processChunksWithRetry(chunks) { const results = []; for (let i = 0; i < chunks.length; i = CONFIG.batchSize) { const batch = chunks.slice(i, i CONFIG.batchSize); const texts = batch.map(chunk => chunk.text); let attempt = 0; let success = false; while (attempt < CONFIG.retryAttempts && !success) { try { const embeddings = await getEmbeddings(texts); batch.forEach((chunk, index) => { results.push({ ...chunk, embedding: embeddings[index], embeddingModel: MODEL_IDENTIFIER, processingTimestamp: new Date().toISOString(), }); }); success = true; } catch (error) { attempt ; console.warn(`Batch ${Math.floor(i / CONFIG.batchSize) 1} attempt ${attempt} failed:`, error.message); if (attempt >= CONFIG.retryAttempts) { batch.forEach(chunk => { results.push({ ...chunk, embedding: null, error: error.message, embeddingModel: MODEL_IDENTIFIER, processingTimestamp: new Date().toISOString(), }); }); } else { await new Promise(resolve => setTimeout(resolve, CONFIG.retryDelay * attempt)); } } } } return results; } /** * Main processing function * @returns {Promise} Promise resolving to processing results */ async function processPageEmbeddings() { try { console.log('Extracting semantic chunks from webpage...'); const chunks = extractSemanticChunks(); console.log(`Extracted ${chunks.length} semantic chunks`); if (chunks.length === 0) { throw new Error('No content chunks found on the page'); } console.log(`Processing embeddings using local model ${MODEL_IDENTIFIER}...`); const embeddedChunks = await processChunksWithRetry(chunks); const successfulEmbeddings = embeddedChunks.filter(chunk => chunk.embedding !== null).length; const failedEmbeddings = embeddedChunks.length - successfulEmbeddings; const result = { success: true, model: MODEL_IDENTIFIER, totalChunks: embeddedChunks.length, successfulEmbeddings, failedEmbeddings, processingTimestamp: new Date().toISOString(), pageUrl: window.location.href, pageTitle: document.title, chunks: embeddedChunks, summary: { avgChunkLength: Math.round(embeddedChunks.reduce((sum, chunk) => sum chunk.text.length, 0) / (embeddedChunks.length || 1)), elementTypes: [...new Set(embeddedChunks.map(chunk => chunk.metadata?.elementType).filter(Boolean))], embeddingDimensions: embeddedChunks.find(chunk => chunk.embedding)?.embedding?.length || null, }, }; console.log(`Processing complete: ${successfulEmbeddings}/${embeddedChunks.length} chunks embedded successfully`); return result; } catch (error) { console.error('Processing failed:', error); return { success: false, error: error.message, model: MODEL_IDENTIFIER, processingTimestamp: new Date().toISOString(), pageUrl: window.location.href, pageTitle: document.title, }; } } // Execute the main function and return results to Screaming Frog return processPageEmbeddings() .then(result => seoSpider.data(JSON.stringify(result, null, 2))) .catch(error => seoSpider.error(`Script execution failed: ${error.message}`));
Googleから昨日リリースされたEmbeddingGemmaをLM Studioで動かしてScreaming Frogで使ってみる。コンパクトで高性能、Macbookで余裕で動く。これでAPIコストかけずにSEOでベクトル関連の分析ができる🎉色々試してみよ
1
1
6
734
❌ MYTH: It doesn't matter what embedding model you use. ✅ FACT: To get optimal and accurate search results, choose an embedding model that is training on similar data to create your embeddings. Pay attention to if it's designed for image, search or another type of unstructured data. Compare the models: zilliz.com/ai-models #embeddingmodel #vectordatabase #Milvus #unstructureddata
3
5
163
❌ MYTH: It doesn't matter what embedding model you use. ✅ FACT: To get optimal and accurate search results, choose an embedding model that is training on similar data to create your embeddings. Pay attention to if it's designed for image, search or another type of unstructured data. Compare the models: zilliz.com/ai-models #embeddingmodel #vectordatabase #Milvus #unstructureddata
5
9
334
Generated from Claud with no prior context except the seed: "Hi Claude I want to serve search results from a Github codebase using vector search over the chunks of the code base. Tell me some high level software architecture designs for this description:" And had a couple of phases of conversation. And it managed to produce this. Very based Claude, very based /vector-search-github/ ├── data_ingestion/ │ ├── github_api_client.py │ │ class GitHubAuth: │ │ - token: str │ │ __init__(token: str) │ │ get_headers() -> dict │ │ │ │ class GitHubRepo: │ │ - name: str │ │ - api_url: str │ │ __init__(name: str, api_url: str) │ │ get_file_list() -> list[str] │ │ │ │ class GitHubClient: │ │ - auth: GitHubAuth │ │ __init__(auth: GitHubAuth) │ │ fetch_repository(repo: GitHubRepo) -> dict │ │ get_file_content(repo: GitHubRepo, file_path: str) -> str │ │ get_commit_history(repo: GitHubRepo, file_path: str) -> list │ │ │ ├── code_chunker.py │ │ class ChunkConfig: │ │ - chunk_size: int │ │ - overlap: int │ │ __init__(chunk_size: int, overlap: int) │ │ │ │ class Chunk: │ │ - content: str │ │ - start_line: int │ │ - end_line: int │ │ __init__(content: str, start_line: int, end_line: int) │ │ │ │ class CodeChunker: │ │ - config: ChunkConfig │ │ __init__(config: ChunkConfig) │ │ chunk_code(code_string: str) -> list[Chunk] │ │ process_file(file_content: str) -> list[Chunk] │ │ │ ├── text_embedding.py │ │ class EmbeddingModel: │ │ - name: str │ │ - dimension: int │ │ __init__(name: str, dimension: int) │ │ load() -> None │ │ │ │ class TextEmbedder: │ │ - model: EmbeddingModel │ │ __init__(model: EmbeddingModel) │ │ embed_text(text: str) -> np.ndarray │ │ embed_chunks(chunks: list[Chunk]) -> list[np.ndarray] │ │ │ │ class SimilarityCalculator: │ │ compute_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float │ │ │ ├── code_analyzer.py │ │ class CodeStructure: │ │ - functions: list[str] │ │ - classes: list[str] │ │ __init__(functions: list[str], classes: list[str]) │ │ │ │ class ComplexityMetrics: │ │ - cyclomatic_complexity: int │ │ - cognitive_complexity: int │ │ __init__(cyclomatic_complexity: int, cognitive_complexity: int) │ │ │ │ class CodeAnalyzer: │ │ extract_structure(code: str) -> CodeStructure │ │ analyze_complexity(code: str) -> ComplexityMetrics │ │ detect_language(code: str) -> str ├── vector_database/ │ ├── vector_store.py │ │ class VectorEntry: │ │ - id: str │ │ - vector: np.ndarray │ │ - metadata: dict │ │ __init__(id: str, vector: np.ndarray, metadata: dict) │ │ │ │ class VectorStore: │ │ - database: DatabaseManager │ │ __init__(database: DatabaseManager) │ │ insert_vector(entry: VectorEntry) -> None │ │ search_similar(query_vector: np.ndarray, top_k: int) -> list[VectorEntry] │ │ delete_vector(vector_id: str) -> None │ │ update_vector(vector_id: str, new_entry: VectorEntry) -> None │ │ │ ├── indexing.py │ │ class IndexConfig: │ │ - algorithm: str │ │ - parameters: dict │ │ __init__(algorithm: str, parameters: dict) │ │ │ │ class VectorIndex: │ │ - config: IndexConfig │ │ - index: Any # Placeholder for the actual index structure │ │ __init__(config: IndexConfig) │ │ build_index(vectors: list[np.ndarray]) -> None │ │ query(query_vector: np.ndarray, num_results: int) -> list[int] │ │ add_to_index(vector: np.ndarray, id: int) -> None │ │ remove_from_index(id: int) -> None │ │ │ ├── database_manager.py │ │ class DatabaseConfig: │ │ - url: str │ │ - max_connections: int │ │ __init__(url: str, max_connections: int) │ │ │ │ class DatabaseManager: │ │ - config: DatabaseConfig │ │ - connection: Any # Placeholder for the actual database connection │ │ __init__(config: DatabaseConfig) │ │ create_tables() -> None │ │ execute_query(query: str, params: tuple) -> None │ │ fetch_all(query: str, params: tuple) -> list │ │ close_connection() -> None ├── search_api/ │ ├── query_handler.py │ │ class Query: │ │ - text: str │ │ - filters: dict │ │ __init__(text: str, filters: dict) │ │ │ │ class SearchResult: │ │ - content: str │ │ - similarity: float │ │ - metadata: dict │ │ __init__(content: str, similarity: float, metadata: dict) │ │ │ │ class QueryHandler: │ │ - vector_store: VectorStore │ │ - text_embedder: TextEmbedder │ │ __init__(vector_store: VectorStore, text_embedder: TextEmbedder) │ │ process_query(query: Query) -> list[SearchResult] │ │ format_results(search_results: list[SearchResult]) -> list[dict] │ │ │ ├── vector_converter.py │ │ class VectorConverter: │ │ - text_embedder: TextEmbedder │ │ __init__(text_embedder: TextEmbedder) │ │ text_to_vector(text: str) -> np.ndarray │ │ │ │ class VectorCompressor: │ │ - compression_method: str │ │ __init__(compression_method: str) │ │ compress_vector(vector: np.ndarray) -> np.ndarray │ │ decompress_vector(compressed_vector: np.ndarray) -> np.ndarray │ │ │ ├── similarity_search.py │ │ class SearchConfig: │ │ - top_k: int │ │ - similarity_threshold: float │ │ __init__(top_k: int, similarity_threshold: float) │ │ │ │ class SimilaritySearch: │ │ - vector_store: VectorStore │ │ - vector_index: VectorIndex │ │ - config: SearchConfig │ │ __init__(vector_store: VectorStore, vector_index: VectorIndex, config: SearchConfig) │ │ search(query_vector: np.ndarray) -> list[SearchResult] │ │ range_search(query_vector: np.ndarray, radius: float) -> list[SearchResult] │ │ │ ├── search_optimizer.py │ │ class QueryCache: │ │ - cache: dict │ │ - max_size: int │ │ __init__(max_size: int) │ │ add(query: str, results: list[SearchResult]) -> None │ │ get(query: str) -> list[SearchResult] │ │ │ │ class SearchOptimizer: │ │ - similarity_search: SimilaritySearch │ │ - query_cache: QueryCache │ │ __init__(similarity_search: SimilaritySearch, query_cache: QueryCache) │ │ optimize_query(query: str) -> str │ │ cache_frequent_queries(query_history: list[str]) -> None │ │ suggest_related_queries(query: str) -> list[str] ├── web_interface/ │ ├── app.py │ │ class FlaskApp: │ │ - app: Flask │ │ - query_handler: QueryHandler │ │ __init__(query_handler: QueryHandler) │ │ index() -> str │ │ search() -> str │ │ advanced_search() -> str │ │ api_search() -> dict ├── background_jobs/ │ ├── update_codebase.py │ │ class UpdateTask: │ │ - repo: GitHubRepo │ │ - last_update: datetime │ │ __init__(repo: GitHubRepo) │ │ │ │ class CodebaseUpdater: │ │ - github_client: GitHubClient │ │ - vector_store: VectorStore │ │ - tasks: list[UpdateTask] │ │ __init__(github_client: GitHubClient, vector_store: VectorStore) │ │ check_for_updates() -> list[str] │ │ process_updates(updates: list[str]) -> None │ │ │ ├── refresh_vector_db.py │ │ class RefreshTask: │ │ - chunk: Chunk │ │ - last_refresh: datetime │ │ __init__(chunk: Chunk) │ │ │ │ class VectorDBRefresher: │ │ - vector_store: VectorStore │ │ - code_chunker: CodeChunker │ │ - text_embedder: TextEmbedder │ │ - tasks: list[RefreshTask] │ │ __init__(vector_store: VectorStore, code_chunker: CodeChunker, text_embedder: TextEmbedder) │ │ refresh_database() -> None │ │ process_chunk(chunk: Chunk) -> None │ │ │ ├── job_scheduler.py │ │ class Job: │ │ - id: str │ │ - func: callable │ │ - trigger: str │ │ - kwargs: dict │ │ __init__(id: str, func: callable, trigger: str, **kwargs) │ │ │ │ class JobScheduler: │ │ - jobs: list[Job] │ │ __init__() │ │ add_job(job: Job) -> None │ │ remove_job(job_id: str) -> None │ │ start() -> None │ │ shutdown() -> None ├── config/ │ ├── settings.py │ ├── logging_config.py │ │ class LogConfig: │ │ - level: str │ │ - format: str │ │ - file: str │ │ __init__(level: str, format: str, file: str) │ │ │ │ setup_logging(config: LogConfig) -> None │ │ get_logger(name: str) -> Logger ├── utils/ │ ├── performance_monitor.py │ │ class Timer: │ │ - start_time: float │ │ - end_time: float │ │ start() -> None │ │ stop() -> float │ │ │ │ class PerformanceMonitor: │ │ - timers: dict[str, Timer] │ │ - memory_usage: list[float] │ │ start_timer(operation: str) -> None │ │ stop_timer(operation: str) -> float │ │ log_memory_usage() -> None │ │ generate_report() -> dict │ │ │ ├── error_handler.py │ │ class ErrorContext: │ │ - module: str │ │ - function: str │ │ - parameters: dict │ │ __init__(module: str, function: str, parameters: dict) │ │ │ │ class ErrorHandler: │ │ - logger: Logger │ │ __init__(logger: Logger) │ │ handle_error(error: Exception, context: ErrorContext) -> None │ │ log_error(error: Exception, context: ErrorContext) -> None │ │ notify_admin(error: Exception, context: ErrorContext) -> None ├── tests/ │ ├── test_github_api_client.py │ ├── test_code_chunker.py │ ├── test_text_embedding.py │ ├── test_vector_store.py │ ├── test_query_handler.py │ ├── test_similarity_search.py │ └── test_web_interface.py └── main.py initialize_components() -> tuple setup_background_jobs(job_scheduler: JobScheduler) -> None main() -> None

1
1
71
Had a blast creating this #MultiModal tutorial/demo! Used fun mix of tools for awesome results! 💡✨ 🖼️@milvusio for the #VectorDatabase 🧠 a tiny Clip #EmbeddingModel by @ashvardanian 🤖 @chatgpt4o as the #LLM. Check it out! ➡️ github.com/christy/ZillizDem… #AI #MachineLearning

12 Jul 2024
🤯 📣 Weekend Project! Test out BOTH #multimodal and #multilingual capabilities by following this tutorial using #Milvus and #GPT4o. 🎥 bit.ly/3RYW43m @cbergman
1
10
1,739
Introducing nomic-embed-text from @nomic_ai 💥 It is a large context length text encoder that surpasses @OpenAI’s text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks. Video: youtu.be/LpcaeQZDVB8 #embeddingmodel #nomicai #openai
2
3
292
Replying to @anil_ozturkk
Seri üzgün; EmbeddingModel supports Chinese (ch) and English (en) (more languages support will come soon), while RerankerModel supports Chinese (ch), English (en), Japanese (ja) and Korean (ko)
1
2
386
Good point regarding OpenAI's text-embedding-ada-002. There are open-source models that outperform OpenAI (see MTEB leaderboard), and they are free and available offline! Some of these popular models are available in LangChain4j out of the box, making embedding straightforward in just 2 lines of code: EmbeddingModel model = new AllMiniLmL6V2EmbeddingModel(); Embedding embedding = model.embed("I love sentence transformers.").content(); Full example: github.com/langchain4j/langc…

You keep reading about sentence embeddings, but you might still not know exactly what they are. You are not alone! 🤗 I wrote a step-by-step walkthrough with code, math, applications, and memes. Kick off your year understanding what embeddings are! osanseviero.github.io/hacker…
1
3
9
1,689
Take a deep-dive into how our teams are moving from simple #FuzzyTextMatching & #GeoLocationFiltering to building an #EmbeddingModel that’s resistant to spelling variations. Our latest byte is on #SiameseNetworks and the exciting potential they hold! bit.ly/3ryMp3t

1
2