Converting websites to markdown comes with 3 distinct problems:

1. Throughly scraping the content of page (high recall)

2. Dropping all the ads/auxilliary content (high precision)

3. And getting the correct layout/section types (formatting)

For #2 and #3 - Trafilatura, Newspaper4k and python-readability based solutions work best out of the box. For #1, any scraping service + selenium is going to do a great job.

Could you elaborate on what your tool does different or better? The area has been stagnant for a while. So curious to hear your learnings.

Vercel!! Watch out for you bill now this is being hugged. Hopefully you are not using <Image /> like they pester you to do.
Great idea to offer image downloads and filtering with GPT!

I built a similar tool last year that doesn't have those features:

Apologies if the UI is slow - you can see some example output on the homepage.

The API it's built on is Urlbox's website screenshot API which performs far better when used directly. You can request markdown along with JS rendered HTML, metadata and screenshot all in one go:

You can even have it all saved directly to your S3-compatible storage:

And/or delivered by webhook:

I've been running over 1 million renders per month using Urlbox's markdown feature for a side project. It's so much better using markdown like this for embeddings and in prompts.

If you want to scrape whole websites like this you might also want to checkout this new tool by dctanner:

It looks like, if the website presents a cookie message, the tool just gets stuck on that and does not parse the actual content. As an example, I tried and all it created was a markdown of the cookie message and some legalese around it.
I've found htmltidy [1] and pandoc html->markdown sufficiently capable.



Pretty cool.

I build something very similar - . Just prepend before any article URL to easily edit, annotate and share it with anyone.

Also works on ArXiv papers!

This was the Show HN post for Smort -

Just tried it on a complex marketing page and it did a great job. Congrats!

I'm curious, if you care to share, what kind of load this places on your host? Is this something that you can keep going for free or will it eventually become non-cost efficient to keep running?

One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations:
This is one of those things that the ever-amazing pandoc ( does very well, on top of supporting virtually every other document format.
This is awesome. I kind of want a browser extension that does this to every page I read and saves them somewhere.
Very cool. We posted about a similar tool we built yesterday

It also crawls (although you can scrape single pages as well)

I also made one like this a while back, you can extract to markdown, html, text or PDF. I found pages that are the straight tool are very hard to SEO-position since there's not a lot of text/content on the page, even if the tool could be very useful. Feedback welcome:

These are all "wrappers around readability" AFAIK (including mine), which is the Mozilla project to make sites look clean and I use often.

A year ago I implemented this as well (albeit as a commercial offering with 100 free scrapes per month): It also has javascript-enabled browsing available in private beta. Will make it public this week. In my experience, people fall back to the simple scraping and not use js that much, if at all.
Nice! I did something similar a while back, but just for substack :)

One example --


Sanitized output:

Raw markdown:

(Would be happy to open source it if anyone cares!)

Tooltips next to options don't work for me on Firefox for Android 124.0.2, and the converter failed on "" ("invalid URL or server busy"). Love the idea though.
Here is an open source alternative to this tool:
Is the code open source?
I wrote a similar article a while ago:

In my case the purpose was to share saved links to my e-reader and used Markdown as an intermediate format through the mercury.js scraping API, but the possibilities are endless.

I built a no-code github blog deployer thingy that lets you deploy a blog to to github pages from a codespace.

Anyways, it uses astro + markdown.

It'd be really neat if I could scrape my medium account to convert it to markdown to save me the trouble.

I tried a lot of tools, but none of them can work on website like
Very cool! We also developed similar tool that can extract information from complex pdfs with embedded tables,images or graphs and get your multimodal data sources LLM-Ready.
Tried it earlier today, but it was hugged to death. Tried it now, but it only gave me the contents of the cookie-wall, not the page I was looking for.

Tried on another page of the same site, then it only gave me the last article on a 6-article page, some weird things going on.

I recently built a website using Markdown as the page source, which then uses CommonMark to produce the HTML. I was interested in seeing how different the extracted markdown looked to the original source markdown. It looks identical!
If anyone looking for a C++ solution to convert HTML to Markdown, I'm using this library in my app.
I've been using gather-cli[0]for this, built by the venerable Brett Terpstra.


I think it was hugged to death
Is this open sourced anywhere by any chance? Are you using GPT to do the conversion, or just doing it yourself by ways of HTML -> Markdown substitutions?
I've been looking for this! My method requires too many steps. I look forward to seeing if this improves my results. Thanks!
This was more helpful than I thought it would be for almost the exact same use case, thank you!
Nice. Turn this into a browser extension and id install it. Feel like id forget about it otherwise
I’m getting the server overload error but assuming this mostly works I’d use it every day!
I did and it just downloaded a file that said can't be found lol
I use markdownload extension for Firefox. Seems to work pretty ok.
nitpick: the tooltip (triggered by the question mark icon) does not work on mobile - at least on my iPhone (both chrome & safari)

May be worth taking a look at.

Good stuff otherwise! Cheers on the launch

"Failed to Convert. Either the URL is invalid or the server is too busy. Please try again later."
links -dump

elinks -dump

lynx -dump

let me guess, you need more?