I am crawling a realestate website, and the idea is
1, to crawl every single day and store the differences to the database.
2, When the property is sold, I'll update the db too.
The challenges are:
1, How do I model the data in Database? I run a scheduler to Run Scrapy every day, I assume it brings no benefit to store the (most likely) the same data over and over, I only need to store the changes to the data crawled the first time. If it has, property address, title, price guide, description, agent name, for example, do I need to make all these fields separate tables to store the historical changes?
2, How do I merge/insert the new data in Database? When everyday it runs scrapy to get the new data, it will update to the existing database to have what I was talking about above – the historical data (diff/changes) rather than all data again ( I assume this is a waste of space?)
3, regarding idea #2, the technical challenge I'm facing is, since I'm crawling the `buy` category in the realestate website, once the property is sold it will be removed and added into the `sold` category. To be able to find the sold property that I have been tracking before, I think I'll have to loop through all rows in my database to get the property ID and attach it to the url and crawl them again to get new information. Now, how do I model this in database? what's the appropriate way for me to track it so that I know which one I need to crawl again?
Overall these are the biggest challenges. I know I may be totally in wrong direction. There's some other tools/techniques that may help but I'm not sure. RabbitMQ, Redis. How about these?
Many thanks!!!
submitted by /u/colafroth
[link] [comments]
from Software Development – methodologies, techniques, and tools. Covering Agile, RUP, Waterfall + more! https://ift.tt/2X2D0SF