1. 3 DISCUSSING
  • Brandon Hassler   Dec 10 2012   Flag

    Great article on the art of scraping. The one issue it failed to bring up was depending on the content you are scraping, you could run into some duplicate content problems with Google.

  • Hartley Brody   Dec 10 2012   Flag

    It depends on what you actually *do* with the content. I left it up to the imagination of the scraper to decide how to use those methods. Obviously, if you're re-posting other site's content as-is, you're probably going to have some issues (not just SEO ones) especially if the site who got scraped feels like you're damaging their branding or hurting their revenue stream.

  • Eric Wu   Dec 10 2012   Flag

    If you're using python as alluded to in the post, I'd suggest taking a look at Scrapy http://scrapy.org/. It's got a great parser, and a bunch of other built in niceties like the ability to write middlewares so you can have a full crawler, and not have to cobble other libraries together to get at what you need. 

You must login to post comments.