Lightnews — Scholar-powered news

Cem Koç @cemkoch.bsky.social · May 7

Huge thanks to amazing to the amazing people:
@pavankumarvasu.bsky.social, Fartash Faghri, Chun-Liang Li, Hadi Pouransari, @onceltuzel.bsky.social, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Christopher Webb

Cem Koç @cemkoch.bsky.social · May 7

Today we have released the code and a demo iOS application for FastVLM - our extremely efficient and fast vision language model which runs on your device using MLX! You can check out the code and the app here: github.com/apple/ml-fas...

1 3 4

Reposted by Cem Koç

Simons Institute for the Theory of Computing @simonsinstitute.bsky.social · Mar 19

Join us! Registration is required.

simons.berkeley.edu/events/move-...

1 5

Reposted by Cem Koç

Josh Susskind @kindsuss.bsky.social · Dec 18

If you're looking for research scientist roles in Europe, check out Marco's post! The Paris team is fantastic, and does diverse idea-driven and impactful research. In addition, MLR is highly collaborative across timezones, so you'd have a chance to work with many others too.

Marco Cuturi @marcocuturi.bsky.social · Dec 18

The Apple Machine Learning Research (MLR) team in Paris has openings for both FTE roles and a short-term post-doc position to contribute to our team's research agenda. Researchers at Apple's MLR (led by Samy Bengio) target impactful publications in top-tier ML venues and OSS.

1 2

Cem Koç @cemkoch.bsky.social · Dec 19

For more, check out our paper on arxiv: arxiv.org/abs/2412.13303

With the amazing people: @pavankumarvasu.bsky.social , Fartash Faghri, Chun-Liang Li, Hadi Pouransari, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, and @onceltuzel.bsky.social

FastVLM: Efficient Vision Encoding for Vision Language Models

Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders su...

arxiv.org

1 1

Cem Koç @cemkoch.bsky.social · Dec 19

What is exciting is that FastVLM model family (VLMs with FastViTHD vision backbone) scales very well with more SFT data, which is vital, and achieves SOTA performance while being significantly faster 🚀

Cem Koç @cemkoch.bsky.social · Dec 19

We ran multiple experiments comparing different resolution sizes (256, 512, 768, 1024) and LLM sizes (0.5B, 1.5B, 7B) to find the optimal setup. FastViTHD's Pareto-optimal curve shows significant gains over FastViT (which is already better than ViTs)👇

Cem Koç @cemkoch.bsky.social · Dec 19

Text-rich tasks require high image resolutions which increase the vision encoding latency + number of image tokens which then leads to higher LLM pre-filling time. Therefore instead of using an isotropic architecture we use a hybrid vision backbone that can scale to higher input resolutions.

Cem Koç @cemkoch.bsky.social · Dec 19

We measure time-to-first-token (TTFT) as the wait time to get the first token response from the VLM which combines the Vision Encoder Latency + LLM pre-filling time (time it takes for LLM to fill the KV-cache and output its first token) and at high resolutions vision encoder latency dominates.

Cem Koç @cemkoch.bsky.social · Dec 19

FastVLM incorporates FastViTHD, a novel hybrid vision encoder backbone designed to output fewer image tokens and significantly reduce the encoding time for high resolution images.

Cem Koç @cemkoch.bsky.social · Dec 19

Excited about vision-language models? 🚀 Check out our latest work on FastVLM, a new family of efficient vision-language models that balances the tradeoff between high-resolution image understanding and latency without compromising accuracy!

arxiv.org/abs/2412.13303

6 1 1

Reposted by Cem Koç

Jiatao Gu @jgu32.bsky.social · Dec 4

🤔Image-to-3D, monocular depth estimation, camera pose estimation, …, can we achieve all of this with just ONE model easily?

🚀Our answer is Yes -- Excited to introduce our latest work: World-consistent Video Diffusion (WVD) with Explicit 3D Modeling!

arxiv.org/abs/2412.01821

1 6 13

Reposted by Cem Koç

Alaa El-Nouby @alaaelnouby.bsky.social · Nov 22

𝗗𝗼𝗲𝘀 𝗮𝘂𝘁𝗼𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗽𝗿𝗲-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝘄𝗼𝗿𝗸 𝗳𝗼𝗿 𝘃𝗶𝘀𝗶𝗼𝗻? 🤔
Delighted to share AIMv2, a family of strong, scalable, and open vision encoders that excel at multimodal understanding, recognition, and grounding 🧵

paper: arxiv.org/abs/2411.14402
code: github.com/apple/ml-aim
HF: huggingface.co/collections/...

3 19 59

Reposted by Cem Koç

Lucie Charlotte Magister @charlottemagister.bsky.social · Nov 21

Looking for an alternative to RAG for personalization?

With PLUM, a pipeline for teaching LLMs to remember prior user conversations, we aim to enable your future personalization research! Joint work with @maartjeterhoeve.bsky.social, Katherine Metcalf and Yizhe Zhang from my internship at Apple.

🧵

1 2 10