I built a similar tool last year that doesn't have those features: https://url2text.com/
Apologies if the UI is slow - you can see some example output on the homepage.
The API it's built on is Urlbox's website screenshot API which performs far better when used directly. You can request markdown along with JS rendered HTML, metadata and screenshot all in one go: https://urlbox.com/extracting-text
You can even have it all saved directly to your S3-compatible storage: https://urlbox.com/s3
And/or delivered by webhook: https://urlbox.com/webhooks
I've been running over 1 million renders per month using Urlbox's markdown feature for a side project. It's so much better using markdown like this for embeddings and in prompts.
If you want to scrape whole websites like this you might also want to checkout this new tool by dctanner: https://usescraper.com/
I build something very similar - smort.io . Just prepend smort.io/ before any article URL to easily edit, annotate and share it with anyone.
Also works on ArXiv papers!
This was the Show HN post for Smort - https://news.ycombinator.com/item?id=30673502
I'm curious, if you care to share, what kind of load this places on your host? Is this something that you can keep going for free or will it eventually become non-cost efficient to keep running?
It also crawls (although you can scrape single pages as well)
These are all "wrappers around readability" AFAIK (including mine), which is the Mozilla project to make sites look clean and I use often.
One example --
URL: https://newsletter.pragmaticengineer.com/p/zirp-software-eng...
Sanitized output: https://substack-ai.vercel.app/u/pragmaticengineer/p/zirp-so...
Raw markdown: https://substack-ai.vercel.app/api/users/pragmaticengineer/p...
(Would be happy to open source it if anyone cares!)
In my case the purpose was to share saved links to my e-reader and used Markdown as an intermediate format through the mercury.js scraping API, but the possibilities are endless.
Anyways, it uses astro + markdown.
It'd be really neat if I could scrape my medium account to convert it to markdown to save me the trouble.
Tried on another page of the same site, then it only gave me the last article on a 6-article page, some weird things going on.
May be worth taking a look at.
Good stuff otherwise! Cheers on the launch
elinks -dump
lynx -dump
let me guess, you need more?
1. Throughly scraping the content of page (high recall)
2. Dropping all the ads/auxilliary content (high precision)
3. And getting the correct layout/section types (formatting)
For #2 and #3 - Trafilatura, Newspaper4k and python-readability based solutions work best out of the box. For #1, any scraping service + selenium is going to do a great job.
Could you elaborate on what your tool does different or better? The area has been stagnant for a while. So curious to hear your learnings.