python: read files in parallel and merge them
Given an iterable of csv files, pandas read and concatenate them
import pandas as pd
from multiprocessing import Pool
files = folderPath.glob('*.csv')
with Pool() as pool:
df = pd.concat(pool.map(pd.read_csv, files))
To see progress, use tqdm
to wrap the input files
files = list(folderPath.glob('*.csv'))
with Pool() as pool:
df = pd.concat(pool.map(pd.read_csv, tqdm(files, total=len(files))))
You can use ThreadPoolExecutor
to achieve the same thing.
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor() as pool:
df = pd.concat(pool.map(pd.read_csv, tqdm(files, total=len(files))))
- ThreadPoolExecutor: Utilizes threads for parallel execution. Threads share the same memory and are light weight compared to processes. ( file I/O, network operations, waiting time events )
- multiprocessing.Pool: Utilizes processes for parallel execution. Each process has its own memory, inter-process communication is more costly than thread communication. Useful for CPU-bound tasks that benefit from multiple CPU cores. ( CPU-bound tasks, heavy computation )
ThreadPoolExecutor | multiprocessing.Pool | |
---|---|---|
Execution | Utilizes threads for parallel execution. | Utilizes processes for parallel execution. |
Best for | Ideal for I/O-bound tasks. | Best suited for CPU-bound tasks. |
GIL | Affected by GIL, but not an issue for I/O-bound tasks. | Not affected by GIL, allows true parallel execution for CPU-bound tasks. |
Memory Usage | Lower, since threads share the same memory space. | Higher, as each process has its own memory space. |
Ease of Use | Slightly easier due to shared memory (care needed with shared data structures). | May require more consideration for data sharing (inter-process communication or shared memory spaces). |
29. Mar 2024