How to crawl big websites with no sitemap?

zepearl

My recommendation specifically about Amazon: don't webcrawl it (you'd need direct help from Amazon itself to get their data in some other way, some direct and consistent data interface).

Reason:

Wrote about ~10 years ago a crawler to discover books and to scan all of the books' ratings (the "stars" given to each book by a reviewer). It was just a hobby project.

At the beginning everything worked well, but after some weeks the layout of the pages and/or the technical IDs behind them started changing in an inconsistent way (some pages were still ok, others were slightly different, others were completely different).

My initial code was (surprisingly, hehe) excellent (nicely structured, flow & tasks of sections easy to understand, etc...), then I kept adding if/else conditions to the crawler at multiple places to make it able to cope with the new layouts/changes, after a couple of months I could hardly understand anything out of it (main point: I never knew if I could delete some portions as 1) it was garbled and 2) I didn't know if Amazon would present again old pages which would make old code relevant again).

Btw. (not directly relevant for the question) the organization of books was (is?) as well a mess:

the same book can be sold with multiple (often slightly, sometimes huge) different titles and/or authors (if more than 1 author wrote it) => ultimate confusion => at that time I fixed that by comparing the names of reviewers and their "stars": if book X had about 90% of the same reviewers AND same "stars" as book Y (data presented by Amazon can slightly change from query to query) then they were mooost probably the same thing (without comparing at all the title nor ISIN - from time to time titles of different versions of the same book were very different, even a human would have been very challenged to identify them as being the same thing, but based on what I saw Amazon knows very well what-is-what therefore even a book-version that sold 0 copies gets all reviews that got its twin book-version that previously sold 10000000 copies).

A honest "Good luck!" with your startup :o)

FrenchDevRemote

First list all the categories/subcategories URLs for the domain you want to target(you'll probably need to do it for each country), pretty straightforward.

Then find how the pagination works, usually it's a get number parameter or a cursor in the URL.

Some websites will have a limited number of results on the same search, so you'll need to tinker with the faceted search, it's a "simple" loop, for example:

for (i in categories)

      for (j in price_ranges)
    
          for (k in price_ranges.pages[j])
  
                  get_products_on_the_page()

leros

You need to build a crawler: https://en.wikipedia.org/wiki/Web_crawler

visox

well i did basically an infinite stream of site consumption, i stored visited and to-be-visited urls but this can easily grow too much.

cpach

What does your application do that needs all product data from Amazon…?

I don’t know, but I would guess that Amazon has put in a lot of effort to prevent crawling.

is_true

scrapy