CommunityNews

CommunityNews

What If We Recaption Billions of Web Images with LLaMA-3?

What If We Recaption Billions of Web Images with LLaMA-3?.
Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users’ text instructions, especially in following complex queries. Our project page is What If We Recaption Billions of Web Images with LLaMA-3?

Read in full here:

This thread was posted by one of our members via one of our news source trackers.

Where Next?

Popular General Dev topics Top

New
New
CommunityNews
ABSTRACT In lieu of a traditional , I’ve tried to distill the essence of the talk into a collection of maxims: All programmers are API ...
New
First poster: bot
API Gateway Trends behind Features: Apache APISIX 3.0 vs. Kong 3.0 - API7.ai. By comparing the open-source API Gateway Apache APISIX and...
New
First poster: bot
sqlglot/python_sql_engine.md at main · tobymao/sqlglot. Python SQL Parser and Transpiler. Contribute to tobymao/sqlglot development by c...
New
First poster: bot
The cheapest flash microcontroller you can buy is actually an Arm Cortex-M0+ - Jay Carlson. Puya’s 10-cent PY32 series is complicating t...
New
First poster: bot
Hector Martin (@[email protected]). Attached: 1 image For those wondering why the hell we need all this safety system stuff for...
New
First poster: bot
When Zig is safer and faster than Rust. There are endless debates online about Rust vs. Zig, this post explores a side of the argument I...
New
First poster: KnowledgeIsPower
Building a Slack/Discord alternative with Tauri/Rust linen <span class="hashtag-icon-placeholder"></span>blog. Introduction My name is K...
New
First poster: alvinkatojr
Over the last decade, we’ve seen great advancements in distributed systems, but the way we program them has seen few fundamental improvem...
New

Other popular topics Top

New
AstonJ
Curious to know which languages and frameworks you’re all thinking about learning next :upside_down_face: Perhaps if there’s enough peop...
New
New
Exadra37
Oh just spent so much time on this to discover now that RancherOS is in end of life but Rancher is refusing to mark the Github repo as su...
New
Maartz
Hi folks, I don’t know if I saw this here but, here’s a new programming language, called Roc Reminds me a bit of Elm and thus Haskell. ...
New
AstonJ
If you get Can't find emacs in your PATH when trying to install Doom Emacs on your Mac you… just… need to install Emacs first! :lol: bre...
New
First poster: joeb
The File System Access API with Origin Private File System. WebKit supports new API that makes it possible for web apps to create, open,...
New
New
CommunityNews
A Brief Review of the Minisforum V3 AMD Tablet. Update: I have created an awesome-minisforum-v3 GitHub repository to list information fo...
New
AstonJ
This is a very quick guide, you just need to: Download LM Studio: https://lmstudio.ai/ Click on search Type DeepSeek, then select the o...
New
OSZAR »