Laravel Problems: Better process to scrape huge amount of data simultaneously

Monday, April 3, 2017

Better process to scrape huge amount of data simultaneously

I am building a web app that has this process.

1) User registers

2) After user registers, i am running a queuing process that scrapes 60k+ worth of customer data. These data came from a 3rd party API and I use curl in doing this.

3) After I scrape these data, I store it in the database.

4) These scraped data from the 3rd party api has a pagination, so what I do is that I checked the response of the API if it has another page (nextPageUrl) and if it has that response, I curl again then get all the customer data and store it again. This continues until there's no nextPageUrl from the api response.

//this is a pseudo code

RegisterUser(user);
CallThirdPartyAPI()

function RegisterUser(user){
 insert_in_users_table(user)
}

function CallThirdPartyAPI($url=null){
 $customers = get_all_customers();
 for($customer as $cust){
  store_in_customers_table();
  if($cust->response_has_next_page_url)
   CallThirdayPartyAPI($url);
  else
   return false;
 }
}

Now as you can see, this is ok if I only have 1 user at a time registering in my web app. But as I have a 100+ users registering in my web app, this is becoming a problem because scraping of data takes 20-30 minutes to be finished and I am running the job queue of only having 2 jobs at a time. So basically the 2 jobs needs to be done in order for the other jobs to be executed.

Now, i am looking for a better solution that would enhance and make the system efficient.

Your suggestion will be greatly appreciated.

PS:

I am running job queuing through supervisor

I have a read replica implemented in my database. I write in the master db while read on the replica to lessen cpu usage of my db.

via PinoyStackOverflower

Monday, April 3, 2017

Better process to scrape huge amount of data simultaneously

Advertisement