Show HN: I made a tool to clean and convert any webpage to Markdown

screye

Converting websites to markdown comes with 3 distinct problems:

1. Throughly scraping the content of page (high recall)

2. Dropping all the ads/auxilliary content (high precision)

3. And getting the correct layout/section types (formatting)

For #2 and #3 - Trafilatura, Newspaper4k and python-readability based solutions work best out of the box. For #1, any scraping service + selenium is going to do a great job.

Could you elaborate on what your tool does different or better? The area has been stagnant for a while. So curious to hear your learnings.

selfie

Vercel!! Watch out for you bill now this is being hugged. Hopefully you are not using <Image /> like they pester you to do.

jot

Great idea to offer image downloads and filtering with GPT!

I built a similar tool last year that doesn't have those features: https://url2text.com/

Apologies if the UI is slow - you can see some example output on the homepage.

The API it's built on is Urlbox's website screenshot API which performs far better when used directly. You can request markdown along with JS rendered HTML, metadata and screenshot all in one go: https://urlbox.com/extracting-text

You can even have it all saved directly to your S3-compatible storage: https://urlbox.com/s3

And/or delivered by webhook: https://urlbox.com/webhooks

I've been running over 1 million renders per month using Urlbox's markdown feature for a side project. It's so much better using markdown like this for embeddings and in prompts.

If you want to scrape whole websites like this you might also want to checkout this new tool by dctanner: https://usescraper.com/

old_dev

It looks like, if the website presents a cookie message, the tool just gets stuck on that and does not parse the actual content. As an example, I tried https://www.cnbc.com/ and all it created was a markdown of the cookie message and some legalese around it.

cratermoon

I've found htmltidy [1] and pandoc html->markdown sufficiently capable.

1 http://www.html-tidy.org/

2 https://pandoc.org/

sabr

Pretty cool.

I build something very similar - smort.io . Just prepend smort.io/ before any article URL to easily edit, annotate and share it with anyone.

Also works on ArXiv papers!

This was the Show HN post for Smort - https://news.ycombinator.com/item?id=30673502

alsetmusic

Just tried it on a complex marketing page and it did a great job. Congrats!

I'm curious, if you care to share, what kind of load this places on your host? Is this something that you can keep going for free or will it eventually become non-cost efficient to keep running?

LeonidBugaev

One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability

blobcode

This is one of those things that the ever-amazing pandoc (https://pandoc.org/) does very well, on top of supporting virtually every other document format.

ChadNauseam

This is awesome. I kind of want a browser extension that does this to every page I read and saves them somewhere.

cpeffer

Very cool. We posted about a similar tool we built yesterday

https://www.firecrawl.dev/

It also crawls (although you can scrape single pages as well)

franciscop

I also made one like this a while back, you can extract to markdown, html, text or PDF. I found pages that are the straight tool are very hard to SEO-position since there's not a lot of text/content on the page, even if the tool could be very useful. Feedback welcome:

https://content-parser.com/

These are all "wrappers around readability" AFAIK (including mine), which is the Mozilla project to make sites look clean and I use often.

RamblingCTO

A year ago I implemented this as well (albeit as a commercial offering with 100 free scrapes per month): https://2markdown.com It also has javascript-enabled browsing available in private beta. Will make it public this week. In my experience, people fall back to the simple scraping and not use js that much, if at all.

areichert

Nice! I did something similar a while back, but just for substack :)

One example --

URL: https://newsletter.pragmaticengineer.com/p/zirp-software-eng...

Sanitized output: https://substack-ai.vercel.app/u/pragmaticengineer/p/zirp-so...

Raw markdown: https://substack-ai.vercel.app/api/users/pragmaticengineer/p...

(Would be happy to open source it if anyone cares!)

neitsab

Tooltips next to options don't work for me on Firefox for Android 124.0.2, and the converter failed on "https://bittersoutherner.com/feature/2023/obituary-for-a-qui..." ("invalid URL or server busy"). Love the idea though.

osener

Here is an open source alternative to this tool: https://github.com/ozanmakes/scrapedown

remorses

Is the code open source?

blacklight

I wrote a similar article a while ago: https://blog.platypush.tech/article/Deliver-articles-to-your...

In my case the purpose was to share saved links to my e-reader and used Markdown as an intermediate format through the mercury.js scraping API, but the possibilities are endless.

J_Shelby_J

I built a no-code github blog deployer thingy that lets you deploy a blog to to github pages from a codespace. https://github.com/ShelbyJenkins/easy-astro-blog-creator

Anyways, it uses astro + markdown.

It'd be really neat if I could scrape my medium account to convert it to markdown to save me the trouble.

eshoyuan

I tried a lot of tools, but none of them can work on website like https://www.globalrelay.com/company/careers/jobs/?gh_jid=507...

JoannaWongs

Very cool! We also developed similar tool that can extract information from complex pdfs with embedded tables,images or graphs and get your multimodal data sources LLM-Ready. https://hellorag.ai

sigio

Tried it earlier today, but it was hugged to death. Tried it now, but it only gave me the contents of the cookie-wall, not the page I was looking for.

Tried on another page of the same site, then it only gave me the last article on a 6-article page, some weird things going on.

g105b

I recently built a website using Markdown as the page source, which then uses CommonMark to produce the HTML. I was interested in seeing how different the extracted markdown looked to the original source markdown. It looks identical!

rubymamis

If anyone looking for a C++ solution to convert HTML to Markdown, I'm using this library https://github.com/tim-gromeyer/html2md in my app.

luckman212

I've been using gather-cli[0]for this, built by the venerable Brett Terpstra.

[0] https://github.com/ttscoff/gather-cli

julienreszka

I think it was hugged to death

radicalriddler

Is this open sourced anywhere by any chance? Are you using GPT to do the conversion, or just doing it yourself by ways of HTML -> Markdown substitutions?

ben_ja_min

I've been looking for this! My method requires too many steps. I look forward to seeing if this improves my results. Thanks!

cassidoo

This was more helpful than I thought it would be for almost the exact same use case, thank you!

jamesstidard

Nice. Turn this into a browser extension and id install it. Feel like id forget about it otherwise

dazzaji

I’m getting the server overload error but assuming this mostly works I’d use it every day!

cchance

I did cnn.com and it just downloaded a file that said cnn.com can't be found lol

brokenmachine

I use markdownload extension for Firefox. Seems to work pretty ok.

ohans

nitpick: the tooltip (triggered by the question mark icon) does not work on mobile - at least on my iPhone (both chrome & safari)

May be worth taking a look at.

Good stuff otherwise! Cheers on the launch

midzer

"Failed to Convert. Either the URL is invalid or the server is too busy. Please try again later."

FounderBurr

Zzzzzzz

technics256

[dead]

nmstoker

[dead]

katehikes88

links -dump

elinks -dump

lynx -dump

let me guess, you need more?