So I am building a data pipeline from Shopify by running a script in order to extract the orders, line items and their respective price, quantity, tax bracket & category (taken from product tags) and cost. This is so we can take the data to visualize into dashboards for different team members. This is a proof of concept for management so we can't use any paid automated tools.
The psuedocode is found below.
We have over 14,000 orders since we started so the script takes several hours to run and I keep facing a ConnectionResetError. I can't help but feel there is a simpler way to extract the data required. All retail prices and costs and inclusive of the sales tax (therefore the profitability report is incorrect as it subtract Net Costs without tax from the costs with tax).
Get orders url.
Loop through the orders in the json dictionary.
For each order:
Loop through each line item
For each line item -->
1) Retrieve the item_id, price and quantity.
2) Use the item_id and variant_id to retrieve a) product tags and b) product inventory_id
3) Use product tags to retrieve the tax bracket and category
4) use inventory_id to retrieve the product cost.
Save everything into a pandas dataframe.
I'm genuinely stumped so any help would be highly appreciated.
Solved! Go to the solution
Multithreading can also run the same function concurrently. You'd break your workload of 14,000 into however many threads the host processor can support (might take some tweaking, but I find that 2-4 is a safe number to start with). Each thread will handle a unique portion of that original workload, run the same function, and collate the results for you at the end.
This is an accepted solution.
You might want to consider doing this in GraphQL. The orders object will have connections to the line items which will have connections to the various inventory and tax details you would need. Then you can perform a bulk query so you don't need to worry about rate limiting.