<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Pandas on Pratap Vardhan</title>
    <link>https://pratapvardhan.com/tags/pandas/</link>
    <description>Recent content in Pandas on Pratap Vardhan</description>
    <generator>Hugo</generator>
    <language>en-us</language>
    <lastBuildDate>Sat, 12 Oct 2024 10:22:33 +0530</lastBuildDate>
    <atom:link href="https://pratapvardhan.com/tags/pandas/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>python: read files in parallel and merge them</title>
      <link>https://pratapvardhan.com/notes/python/read-files-parallel-merge/</link>
      <pubDate>Fri, 29 Mar 2024 00:00:00 +0000</pubDate>
      <guid>https://pratapvardhan.com/notes/python/read-files-parallel-merge/</guid>
      <description>&lt;p&gt;Given an iterable of csv files, pandas read and concatenate them&lt;/p&gt;&#xA;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; pandas &lt;span style=&#34;color:#66d9ef&#34;&gt;as&lt;/span&gt; pd&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;from&lt;/span&gt; multiprocessing &lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; Pool&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;files &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; folderPath&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;glob(&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;*.csv&amp;#39;&lt;/span&gt;)&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;with&lt;/span&gt; Pool() &lt;span style=&#34;color:#66d9ef&#34;&gt;as&lt;/span&gt; pool:&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    df &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; pd&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;concat(pool&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;map(pd&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;read_csv, files))&#xA;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;To see progress, use &lt;code&gt;tqdm&lt;/code&gt; to wrap the input files&lt;/p&gt;&#xA;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;files &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; list(folderPath&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;glob(&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;*.csv&amp;#39;&lt;/span&gt;))&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;with&lt;/span&gt; Pool() &lt;span style=&#34;color:#66d9ef&#34;&gt;as&lt;/span&gt; pool:&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    df &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; pd&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;concat(pool&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;map(pd&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;read_csv, tqdm(files, total&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;len(files))))&#xA;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;You can use &lt;code&gt;ThreadPoolExecutor&lt;/code&gt; to achieve the same thing.&lt;/p&gt;&#xA;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;from&lt;/span&gt; concurrent.futures &lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; ThreadPoolExecutor&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;with&lt;/span&gt; ThreadPoolExecutor() &lt;span style=&#34;color:#66d9ef&#34;&gt;as&lt;/span&gt; pool:&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    df &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; pd&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;concat(pool&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;map(pd&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;read_csv, tqdm(files, total&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;len(files))))&#xA;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;&#xA;&lt;li&gt;&lt;em&gt;ThreadPoolExecutor&lt;/em&gt;: Utilizes threads for parallel execution. Threads share the same memory and are light weight compared to processes. ( file I/O, network operations, waiting time events )&lt;/li&gt;&#xA;&lt;li&gt;&lt;em&gt;multiprocessing.Pool&lt;/em&gt;: Utilizes processes for parallel execution. Each process has its own memory, inter-process communication is more costly than thread communication. Useful for CPU-bound tasks that benefit from multiple CPU cores. ( CPU-bound tasks, heavy computation )&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;table&gt;&#xA;  &lt;thead&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;th&gt;&lt;/th&gt;&#xA;          &lt;th&gt;ThreadPoolExecutor&lt;/th&gt;&#xA;          &lt;th&gt;multiprocessing.Pool&lt;/th&gt;&#xA;      &lt;/tr&gt;&#xA;  &lt;/thead&gt;&#xA;  &lt;tbody&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;&lt;strong&gt;Execution&lt;/strong&gt;&lt;/td&gt;&#xA;          &lt;td&gt;Utilizes threads for parallel execution.&lt;/td&gt;&#xA;          &lt;td&gt;Utilizes processes for parallel execution.&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;&#xA;          &lt;td&gt;Ideal for I/O-bound tasks.&lt;/td&gt;&#xA;          &lt;td&gt;Best suited for CPU-bound tasks.&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;&lt;strong&gt;GIL&lt;/strong&gt;&lt;/td&gt;&#xA;          &lt;td&gt;Affected by GIL, but not an issue for I/O-bound tasks.&lt;/td&gt;&#xA;          &lt;td&gt;Not affected by GIL, allows true parallel execution for CPU-bound tasks.&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;&lt;strong&gt;Memory Usage&lt;/strong&gt;&lt;/td&gt;&#xA;          &lt;td&gt;Lower, since threads share the same memory space.&lt;/td&gt;&#xA;          &lt;td&gt;Higher, as each process has its own memory space.&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;&lt;strong&gt;Ease of Use&lt;/strong&gt;&lt;/td&gt;&#xA;          &lt;td&gt;Slightly easier due to shared memory (care needed with shared data structures).&lt;/td&gt;&#xA;          &lt;td&gt;May require more consideration for data sharing (inter-process communication or shared memory spaces).&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;  &lt;/tbody&gt;&#xA;&lt;/table&gt;</description>
    </item>
  </channel>
</rss>
