Cem Koç
@cemkoch.bsky.social
24 followers 34 following 9 posts
Coffee Lover • Husky Dad • ML Researcher @  • Berkeley Grad
Posts Media Videos Starter Packs
cemkoch.bsky.social
Huge thanks to amazing to the amazing people:
@pavankumarvasu.bsky.social, Fartash Faghri, Chun-Liang Li, Hadi Pouransari, @onceltuzel.bsky.social, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Christopher Webb
cemkoch.bsky.social
Today we have released the code and a demo iOS application for FastVLM - our extremely efficient and fast vision language model which runs on your device using MLX! You can check out the code and the app here: github.com/apple/ml-fas...
Reposted by Cem Koç
kindsuss.bsky.social
If you're looking for research scientist roles in Europe, check out Marco's post! The Paris team is fantastic, and does diverse idea-driven and impactful research. In addition, MLR is highly collaborative across timezones, so you'd have a chance to work with many others too.
marcocuturi.bsky.social
The Apple Machine Learning Research (MLR) team in Paris has openings for both FTE roles and a short-term post-doc position to contribute to our team's research agenda. Researchers at Apple's MLR (led by Samy Bengio) target impactful publications in top-tier ML venues and OSS.
cemkoch.bsky.social
What is exciting is that FastVLM model family (VLMs with FastViTHD vision backbone) scales very well with more SFT data, which is vital, and achieves SOTA performance while being significantly faster 🚀
cemkoch.bsky.social
We ran multiple experiments comparing different resolution sizes (256, 512, 768, 1024) and LLM sizes (0.5B, 1.5B, 7B) to find the optimal setup. FastViTHD's Pareto-optimal curve shows significant gains over FastViT (which is already better than ViTs)👇
cemkoch.bsky.social
Text-rich tasks require high image resolutions which increase the vision encoding latency + number of image tokens which then leads to higher LLM pre-filling time. Therefore instead of using an isotropic architecture we use a hybrid vision backbone that can scale to higher input resolutions.
cemkoch.bsky.social
We measure time-to-first-token (TTFT) as the wait time to get the first token response from the VLM which combines the Vision Encoder Latency + LLM pre-filling time (time it takes for LLM to fill the KV-cache and output its first token) and at high resolutions vision encoder latency dominates.
cemkoch.bsky.social
FastVLM incorporates FastViTHD, a novel hybrid vision encoder backbone designed to output fewer image tokens and significantly reduce the encoding time for high resolution images.
cemkoch.bsky.social
Excited about vision-language models? 🚀 Check out our latest work on FastVLM, a new family of efficient vision-language models that balances the tradeoff between high-resolution image understanding and latency without compromising accuracy!

arxiv.org/abs/2412.13303
Reposted by Cem Koç
jgu32.bsky.social
🤔Image-to-3D, monocular depth estimation, camera pose estimation, …, can we achieve all of this with just ONE model easily?

🚀Our answer is Yes -- Excited to introduce our latest work: World-consistent Video Diffusion (WVD) with Explicit 3D Modeling!

arxiv.org/abs/2412.01821
WVD Pipeline
Reposted by Cem Koç
alaaelnouby.bsky.social
𝗗𝗼𝗲𝘀 𝗮𝘂𝘁𝗼𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗽𝗿𝗲-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝘄𝗼𝗿𝗸 𝗳𝗼𝗿 𝘃𝗶𝘀𝗶𝗼𝗻? 🤔
Delighted to share AIMv2, a family of strong, scalable, and open vision encoders that excel at multimodal understanding, recognition, and grounding 🧵

paper: arxiv.org/abs/2411.14402
code: github.com/apple/ml-aim
HF: huggingface.co/collections/...
Reposted by Cem Koç
charlottemagister.bsky.social
Looking for an alternative to RAG for personalization?

With PLUM, a pipeline for teaching LLMs to remember prior user conversations, we aim to enable your future personalization research! Joint work with @maartjeterhoeve.bsky.social, Katherine Metcalf and Yizhe Zhang from my internship at Apple.

🧵