Monday, April 3, 2017

Better process to scrape huge amount of data simultaneously

I am building a web app that has this process.

1) User registers

2) After user registers, i am running a queuing process that scrapes 60k+ worth of customer data. These data came from a 3rd party API and I use curl in doing this.

3) After I scrape these data, I store it in the database.

4) These scraped data from the 3rd party api has a pagination, so what I do is that I checked the response of the API if it has another page (nextPageUrl) and if it has that response, I curl again then get all the customer data and store it again. This continues until there's no nextPageUrl from the api response.

//this is a pseudo code

RegisterUser(user);
CallThirdPartyAPI()

function RegisterUser(user){
 insert_in_users_table(user)
}

function CallThirdPartyAPI($url=null){
 $customers = get_all_customers();
 for($customer as $cust){
  store_in_customers_table();
  if($cust->response_has_next_page_url)
   CallThirdayPartyAPI($url);
  else
   return false;
 }
}

Now as you can see, this is ok if I only have 1 user at a time registering in my web app. But as I have a 100+ users registering in my web app, this is becoming a problem because scraping of data takes 20-30 minutes to be finished and I am running the job queue of only having 2 jobs at a time. So basically the 2 jobs needs to be done in order for the other jobs to be executed.

Now, i am looking for a better solution that would enhance and make the system efficient.

Your suggestion will be greatly appreciated.

PS:

I am running job queuing through supervisor

I have a read replica implemented in my database. I write in the master db while read on the replica to lessen cpu usage of my db.



via PinoyStackOverflower

Advertisement