python: read files in parallel and merge them

python: read files in parallel and merge them

Given an iterable of csv files, pandas read and concatenate them

import pandas as pd
from multiprocessing import Pool

files = folderPath.glob('*.csv')
with Pool() as pool:
    df = pd.concat(pool.map(pd.read_csv, files))

To see progress, use tqdm to wrap the input files

files = list(folderPath.glob('*.csv'))
with Pool() as pool:
    df = pd.concat(pool.map(pd.read_csv, tqdm(files, total=len(files))))

You can use ThreadPoolExecutor to achieve the same thing.

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor() as pool:
    df = pd.concat(pool.map(pd.read_csv, tqdm(files, total=len(files))))
  • ThreadPoolExecutor: Utilizes threads for parallel execution. Threads share the same memory and are light weight compared to processes. ( file I/O, network operations, waiting time events )
  • multiprocessing.Pool: Utilizes processes for parallel execution. Each process has its own memory, inter-process communication is more costly than thread communication. Useful for CPU-bound tasks that benefit from multiple CPU cores. ( CPU-bound tasks, heavy computation )
ThreadPoolExecutor multiprocessing.Pool
Execution Utilizes threads for parallel execution. Utilizes processes for parallel execution.
Best for Ideal for I/O-bound tasks. Best suited for CPU-bound tasks.
GIL Affected by GIL, but not an issue for I/O-bound tasks. Not affected by GIL, allows true parallel execution for CPU-bound tasks.
Memory Usage Lower, since threads share the same memory space. Higher, as each process has its own memory space.
Ease of Use Slightly easier due to shared memory (care needed with shared data structures). May require more consideration for data sharing (inter-process communication or shared memory spaces).

29. Mar 2024