Mikolaj Czerkawski

mikonvergence

AI & ML interests

None yet

Recent Activity

updated a Space about 1 month ago
ESA-philab/README
View all activity

Organizations

Gradio-Themes-Party's profile picture satellite-image-deep-learning's profile picture European Space Agency Φ-lab's profile picture Major TOM's profile picture HF中国镜像站 Discord Community's profile picture

mikonvergence's activity

reacted to mkluczek's post with 🚀🔥 3 months ago
view post
Post
1709
First Global and Dense Open Embedding Dataset of Earth! 🌍 🤗

Introducing the Major TOM embeddings dataset, created in collaboration with CloudFerro S.A. 🔶 and Φ-lab at the European Space Agency (ESA) 🛰️. Together with @mikonvergence and Jędrzej S. Bojanowski, we present the first open-access dataset of Copernicus embeddings, offering dense, global coverage across the full acquisition areas of Sentinel-1 and Sentinel-2 sensors.

💡 Highlights:
📊 Data: Over 8 million Sentinel-1 & Sentinel-2 images processed, distilling insights from 9.368 trillion pixels of raw data.
🧠 Models: Foundation models include SigLIP, DINOv2, and SSL4EO.
📦 Scale: 62 TB of raw satellite data processed into 170M+ embeddings.

This project delivers open and free vectorized expansions of Major-TOM/README datasets, setting a new standard for embedding releases and enabling lightweight, scalable ingestion of Earth Observation (EO) data for countless applications.

🤗 Explore the datasets:
Major-TOM/Core-S2L1C-SSL4EO
Major-TOM/Core-S1RTC-SSL4EO
Major-TOM/Core-S2RGB-DINOv2
Major-TOM/Core-S2RGB-SigLIP

📖 Check paper: Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space (2412.05600)
💻 Code notebook: https://github.com/ESA-PhiLab/Major-TOM/blob/main/05-Generate-Major-TOM-Embeddings.ipynb
  • 1 reply
·
posted an update 6 months ago
view post
Post
2284
𝐍𝐞𝐰 𝐑𝐞𝐥𝐞𝐚𝐬𝐞: 𝐌𝐚𝐣𝐨𝐫 𝐓𝐎𝐌 𝐃𝐢𝐠𝐢𝐭𝐚𝐥 𝐄𝐥𝐞𝐯𝐚𝐭𝐢𝐨𝐧 𝐌𝐨𝐝𝐞𝐥 𝐄𝐱𝐩𝐚𝐧𝐬𝐢𝐨𝐧 🗺️

Dataset: Major-TOM/Core-DEM

Today with European Space Agency - ESA and Adobe Research, we release a global expansion to Major TOM with GLO-30 DEM data.

You can now instantly access nearly 2M of Major TOM samples with elevation data to build your next AI model for EO. 🌍

🔍 Browse the data in our usual viewer app: Major-TOM/MajorTOM-Core-Viewer

Fantastic work championed by Paul Borne--Pons @NewtNewt 🚀
posted an update 12 months ago
view post
Post
1533
𝗠𝗮𝗷𝗼𝗿 𝗧𝗢𝗠: 𝗣𝗹𝗮𝗻𝗲𝘁 𝗘𝗮𝗿𝘁𝗵 𝗶𝘀 𝗯̶𝗹̶𝘂̶𝗲̶ 𝟱.𝟰𝟬𝟱 𝗚𝗛𝘇

🚨 EXPANSION RELEASE: 𝗦𝗲𝗻𝘁𝗶𝗻𝗲𝗹-𝟭 𝗶𝘀 𝗻𝗼𝘄 𝗮𝘃𝗮𝗶𝗹𝗮𝗯𝗹𝗲 in the MajorTOM-Core!
Major-TOM/Core-S1RTC

🏎 Together with @aliFrancis we've been racing to release the first official expansion to the Major TOM project.

MajorTOM-Core-S1RTC contains 1,469,955 of SAR images paired to Sentinel-2 images from Core-S2.

🔍We cover more than 65% of the optical coverage with an average time shift of 7 days.

16 TB of radiometrically calibrated SAR imagery, available in the exact same format as the existing Major-TOM data.

🗺️ You can explore instantly in our viewing app:
Major-TOM/MajorTOM-Core-Viewer

So, what now?

🧱 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐭𝐲 𝐆𝐫𝐨𝐰𝐭𝐡: our community continues to grow! To coordinate the upcoming expansions as well as use cases of the open data, we will organise a meet up on 23 April, you can 𝐫𝐞𝐠𝐢𝐬𝐭𝐞𝐫 𝐲𝐨𝐮𝐫 𝐢𝐧𝐭𝐞𝐫𝐞𝐬𝐭 here: https://forms.gle/eBj8JvibJx9b6PLf9

🚂 𝐎𝐩𝐞𝐧 𝐃𝐚𝐭𝐚 𝐟𝐨𝐫 𝐎𝐩𝐞𝐧 𝐌𝐨𝐝𝐞𝐥𝐬: Major-TOM Core dataset is currently supporting several strands of ongoing research within and outwith our lab and we are looking forward to the time when we can release models that take advantage of that data! https://huggingface.co/Major-TOM

📌 𝐏𝐨𝐬𝐭𝐞𝐫 𝐚𝐭 𝐈𝐆𝐀𝐑𝐒𝐒: We will present Major TOM project as a poster at IGARSS in Athens (July) - come talk to us if you're there! You can access the paper here: Major TOM: Expandable Datasets for Earth Observation (2402.12095)


🌌 Developed at European Space Agency Φ-lab in partnership with HF中国镜像站
reacted to osanseviero's post with 🔥 12 months ago
view post
Post
Diaries of Open Source. Part 5!

🤯Contextual KTO Mistral PairRM: this model combines iterative KTO, SnorkelAI DPO dataset, Allenai PairRM for ranking, Mistral for the base model, and is a very strong model with Claude 3 quality on AlpacaEval 2.0
Final model: ContextualAI/Contextual_KTO_Mistral_PairRM
Dataset: snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset
Leaderboard: https://tatsu-lab.github.io/alpaca_eval/
Base model: mistralai/Mistral-7B-Instruct-v0.2

🤏 tinyBenchmarks: Quick and cheap LLM evaluation!
Code: https://github.com/felipemaiapolo/tinyBenchmarks
Paper: tinyBenchmarks: evaluating LLMs with fewer examples (2402.14992)
Data: tinyBenchmarks/tinyMMLU

🎨Transformers.js 2.16 includes StableLM, speaker verification and diarization, and better chat templating. Try some fun demos!
- Xenova/video-object-detection
- Xenova/cross-encoder-web
- Xenova/the-tokenizer-playground

🏴‍☠️ Abascus Liberated-Qwen1.5-72B, a Qwen 72B-based model that strongly follows system prompts
Model: abacusai/Liberated-Qwen1.5-72B

👀Design2Code: benchmark of webpage screenshots to code
Data: SALT-NLP/Design2Code
Project https://salt-nlp.github.io/Design2Code/
Paper Design2Code: How Far Are We From Automating Front-End Engineering? (2403.03163)

🌎Data and models around the world
- One of the biggest Italian datasets https://hf.co/datasets/manalog/UsenetArchiveIT
- IndicLLMSuite: argest Pre-training and Instruction Fine-tuning dataset collection across 22 Indic languages ai4bharat/indicllmsuite-65ee7d225c337fcfa0991707
- Hebrew-Gemma-11B, the best base Hebrew model yam-peleg/Hebrew-Gemma-11B
- Komodo-7B, a family of multiple Indonesian languages LLMs Yellow-AI-NLP/komodo-7b-base

You can find the previous part at https://huggingface.co/posts/osanseviero/127895284909100
reacted to osanseviero's post with 👍 about 1 year ago
view post
Post
Diaries of Open Source. Part 3! OS goes to the moon!

💻 OpenCodeInterpreter, a family of very powerful code generation models
Models: m-a-p/opencodeinterpreter-65d312f6f88da990a64da456
Paper: OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement (2402.14658)
Demo m-a-p/OpenCodeInterpreter_demo

🔷🔶Zephyr 7B Gemma, Gemma fine-tuned with the Zephyr recipe
Model: HuggingFaceH4/zephyr-7b-gemma-v0.1
Demo: HuggingFaceH4/zephyr-7b-gemma-chat
GH Repo: https://github.com/huggingface/alignment-handbook

🪆The MixedBread folks released a 2D Matryoshka text embedding model, which means you can dynamically change the embedding size and layer counts
Model: mixedbread-ai/mxbai-embed-2d-large-v1
Release blog post: https://www.mixedbread.ai/blog/mxbai-embed-2d-large-v1

🐋Microsoft released Orca Math, which includes 200K grade school math problems
Dataset: microsoft/orca-math-word-problems-200k

🥷IBM silently released Merlinite, a cool model trained on Mixtral-generated synthetic data using a novel LAB method https://huggingface.co/ibm/merlinite-7b

🌚 Moondream2 - a small vision language model to run on-device!
Model: vikhyatk/moondream2
Demo: vikhyatk/moondream2

🏙️CityDreamer: 3D City Generation
Demo: hzxie/city-dreamer
Repo: https://github.com/hzxie/city-dreamer
Model: hzxie/city-dreamer

🌏ML in all languages
Sailor, a family of South-East Asian languages models sail/sailor-language-models-65e19a749f978976f1959825
Samvaad dataset, which includes 140k QA pairs in Hindi, Bengali, Marathi, Tamil, Telugu, Oriya, Punjabi, and Gujarati GenVRadmin/Samvaad-Mixed-Language-2

You can see the previous part at https://huggingface.co/posts/osanseviero/674644082063278
  • 1 reply
·
replied to robmarkcole's post about 1 year ago
reacted to robmarkcole's post with 🤗 about 1 year ago
reacted to osanseviero's post with 👍 about 1 year ago
view post
Post
Diaries of Open Source. Part 2. Open Source is going brrrrr

🚀The European Space Agency releases MajorTOM, a dataset of earth observation covering half the earth. The dataset has 2.5 trillion pixels! Congrats @aliFrancis and @mikonvergence !
Dataset: Major-TOM/Core-S2L2A
Viewer: Major-TOM/MajorTOM-Core-Viewer

🍞Re-ranking models by MixedBreadAI, with very high quality, Apache 2 license, and easy to use!
Models: https://huggingface.co/models?other=reranker&sort=trending&search=mixedbread-ai
Blog: https://www.mixedbread.ai/blog/mxbai-rerank-v1

🧊StabilityAI and TripoAI release TripoSR, a super-fast MIT-licensed image-to-3D model!
Model: stabilityai/TripoSR
Demo: stabilityai/TripoSR

🤝Together AI and HazyResearch release Based
Models and datasets: hazyresearch/based-65d77fb76f9c813c8b94339c
GH repo: https://github.com/HazyResearch/based

🌊LaVague: an open-source pipeline to turn natural language into browser actions! It can run locally with HuggingFaceH4/zephyr-7b-gemma-v0.1
Read more about it at https://huggingface.co/posts/dhuynh95/717319217106504

🏆Berkeley Function-Calling Leaderboard
Read about it: https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html
Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html

🐬Sailor-Chat: chat models built on top of OpenOrca and @sarahooker CohereForAI Aya project. They can be used for South-East Asia languages such as Indonesian, Thai, Vietnamese, Malay and Lao!
Models: sail/sailor-language-models-65e19a749f978976f1959825
Demo: https://huggingface.co/spaces/sail/Sailor-7B-Chat

🤗Arabic-OpenHermes-2.5: OpenHermes dataset translated to Arabic 2A2I/Arabic-OpenHermes-2.5

See the previous part here https://huggingface.co/posts/osanseviero/622788932781684
  • 3 replies
·
reacted to robmarkcole's post with ❤️ about 1 year ago
reacted to aliFrancis's post with 🤗 about 1 year ago
view post
Post
🗺 Major TOM: Expandable Datasets for Earth Observation

🚨 RECORD-BREAKING EO DATASET: the largest ever ML-ready Sentinel-2 dataset! It covers almost every single point on Earth captured by the Copernicus Sentinel-2 satellite. @mikonvergence and I are thrilled to finally announce the release of Major-TOM/Core-S2L2A and Major-TOM/Core-S2L1C

🌍 About half of the entire planet is covered. That's 2,245,886 patches of 1068 x 1068 pixels, available in both L1C and L2A. At 10 m resolution, we've got 256 million square km with over 2.5 trillion pixels. It's all yours with a few lines of code. See the paper linked below 🔽 for more info!

🧱 And this is just the beginning. We are currently preparing more datasets from different satellites for the Major TOM org. TOM stands for Terrestrial Observation Metaset - a simple set of rules for building an ecosystem of ML-ready EO datasets, which can be seamlessly combined as if they were Lego bricks.

🚴‍♀️ Want to take the dataset for a spin? We have a viewer app on spaces that lets you go anywhere on Earth and shows you the data, if its available Major-TOM/MajorTOM-Core-Viewer

📰 Preprint paper: Major TOM: Expandable Datasets for Earth Observation (2402.12095)
💻 Colab example: https://colab.research.google.com/github/ESA-PhiLab/Major-TOM/blob/main/03-Filtering-in-Colab.ipynb

Thank you to the amazing 🤗HF中国镜像站 team for the support on this one! @osanseviero @lhoestq @BrigitteTousi
  • 1 reply
·