{"id":14870,"date":"2026-05-17T22:30:26","date_gmt":"2026-05-17T22:30:26","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=14870"},"modified":"2026-05-17T22:30:26","modified_gmt":"2026-05-17T22:30:26","slug":"pandas-isnt-going-anyplace-why-its-nonetheless-my-go-to-for-knowledge-wrangling","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=14870","title":{"rendered":"Pandas Isn\u2019t Going Anyplace: Why It\u2019s Nonetheless My Go-To for Knowledge Wrangling"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p class=\"wp-block-paragraph\"> studying knowledge science in 2020, Pandas was probably the most common instruments. Though new instruments concentrate on bettering Pandas\u2019 weaknesses in dealing with very giant datasets, I nonetheless use Pandas for a lot of knowledge cleansing, processing, and evaluation duties. Sure, Pandas offers me a tough time when working with billions of rows, however it&#8217;s undoubtedly greater than sufficient for working with something beneath that.<\/p>\n<p class=\"wp-block-paragraph\">I see Pandas being utilized in not just for EDA or in notebooks but additionally in manufacturing methods.<\/p>\n<p class=\"wp-block-paragraph\">On this article, I\u2019ll go over some knowledge cleansing and processing operations to display how succesful Pandas is.<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s begin with the dataset, which accommodates inventory conserving models (SKUs) and a search API responses for these SKUs.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import pandas as pd\n\nsearch_results = pd.read_csv(\"search_results.csv\")\n\nsearch_results.head()<\/code><\/pre>\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/05\/image-176.png\" alt=\"\" class=\"wp-image-659551\"\/><\/figure>\n<p class=\"wp-block-paragraph\">Search result&#8217;s a listing of dictionaries and appears like this:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">search_results.loc[0, \"search_result\"]\n\n\"[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}}, \n{'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}}, \n{'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}}, \n{'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}}, \n{'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}}, \n{'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}}, \n{'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}}, \n{'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}}, \n{'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}}, \n{'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}] \n... and 5 entities remaining\"<\/code><\/pre>\n<p class=\"wp-block-paragraph\">As we see within the output, it\u2019s not a correct record of dictionary format due to the final half (\u201c\u2026 and 5 entities remaining\u201d). Additionally, it\u2019s saved as a single string.<\/p>\n<p class=\"wp-block-paragraph\">As a way to make higher use of it, we have to convert it to a correct record of dictionaries. The next line of code removes the final half by splitting the string at \u201c\u2026\u201d and takes the primary cut up. <\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">search_results.loc[0, \"search_result\"].cut up(\"...\")[0].strip()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Nevertheless, the output continues to be a single string. We will use the built-in ast module of Python to transform it to a listing:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import ast\n\nres = ast.literal_eval(search_results.loc[0, \"search_result\"].cut up(\"...\")[0].strip())\n\nres\n\n[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}},\n {'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}},\n {'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}},\n {'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}},\n {'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}},\n {'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}},\n {'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}},\n {'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}},\n {'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}},\n {'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}]<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We now have the search outcomes as a correct record of dictionaries. This was just for a single row. We have to apply the identical operation to all SKUs (i.e. whole SKU column).<\/p>\n<p class=\"wp-block-paragraph\">One choice is to go over all of the rows in a for loop and carry out the identical operation. Nevertheless, this isn&#8217;t the most suitable choice. We should always desire vectorized operations after we can. A vectorized operation principally means executing the code on all rows without delay.<\/p>\n<p class=\"wp-block-paragraph\">On a single row, I used splitting to eliminate the final a part of the string nevertheless it didn&#8217;t work in a vectorized operation. A extra strong choice appears to be utilizing a regex.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">search_results.loc[:, 'search_result'] = search_results['search_result'].str.substitute(r\"....*\", \"\", regex=True).str.strip()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">This code selects \u201c\u2026\u201d and all the things that comes after it and replaces them with nothing. In different phrases, it removes \u201c\u2026 and 5 entities remaining\u201d half.<\/p>\n<p class=\"wp-block-paragraph\">We now have all of the rows within the search outcomes column as a correct record of dictionaries.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">search_results.loc[10, \"search_result\"]\n\n\"[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}},\n {'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}},\n {'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}},\n {'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}},\n {'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}},\n {'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}},\n {'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}},\n {'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}},\n {'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}},\n {'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}]\"<\/code><\/pre>\n<p class=\"wp-block-paragraph\">They\u2019re nonetheless saved as a string however I can simply convert them to a listing utilizing the ast module, which I&#8217;ll do within the subsequent step.<\/p>\n<p class=\"wp-block-paragraph\">What I\u2019m thinking about is the SKUs returned within the search outcomes. I\u2019ll create a brand new column by extracting the SKUs within the dictionaries. I can entry them utilizing the \u201cmy_id\u201d key of the dictionary.<\/p>\n<p class=\"wp-block-paragraph\">There are 3 components of this operation:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Convert the search outcome string to record utilizing the literal_eval perform<\/li>\n<li class=\"wp-block-list-item\">Extract SKU from the my_id key of the dictionary<\/li>\n<li class=\"wp-block-list-item\">Do that in a listing comprehension to get SKUs from all of the dictionaries within the record<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">We will do all these operations by making use of a lambda perform to all rows as follows:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">search_results.loc[:, \"result_skus\"] = \nsearch_results[\"search_result\"].apply(lambda x: [item['my_id'] for merchandise in ast.literal_eval(x)])\n\nsearch_results.head()<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/05\/image-177-1024x174.png\" alt=\"\" class=\"wp-image-659556\"\/><\/figure>\n<p class=\"wp-block-paragraph\">Every row within the result_skus column accommodates a listing of 10 SKUs. Let\u2019s say I have to have these 10 SKUs in several rows. For every row within the sku column, there might be 10 rows created from the record within the result_skus column. There&#8217;s a quite simple means of doing this in Pandas, which is the explode perform.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">knowledge = search_results[[\"sku\", \"result_skus\"]].explode(\"result_skus\", ignore_index=True)\n\nknowledge.head()<\/code><\/pre>\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/05\/image-184.png\" alt=\"\" class=\"wp-image-659569\"\/><\/figure>\n<p class=\"wp-block-paragraph\">We created a brand new dataframe with sku and result_skus column. The drawing beneath demonstrates what the explode perform does:<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/05\/image-185-1024x225.png\" alt=\"\" class=\"wp-image-659570\"\/><\/figure>\n<p class=\"wp-block-paragraph\">Take into account the other. We have now a dataframe as proven above however need to have all outcomes for an sku in a single row. <\/p>\n<p class=\"wp-block-paragraph\">We will use the groupby perform to group the rows by sku after which apply the record perform on the result_skus column:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">new_data = knowledge.groupby(\"sku\", as_index=False)[\"result_skus\"].apply(record)\n\nnew_data.head()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">This may get us again to the earlier step:<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/05\/image-187-1024x260.png\" alt=\"\" class=\"wp-image-659573\"\/><\/figure>\n<p class=\"wp-block-paragraph\">Utilizing the explode perform, we created a dataframe with a separate row for every sku within the result_skus column. What if we have to have them separated to totally different columns as an alternative of rows?<\/p>\n<p class=\"wp-block-paragraph\">One choice is to use the pd.Collection perform to the result_skus column and concatenate the ensuing columns to the unique dataframe.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">new_cols = new_data[\"result_skus\"].apply(pd.Collection)\n\nnew_data = pd.concat([new_data, new_cols], axis=1)\n\nnew_data.head()<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/05\/image-189-1024x283.png\" alt=\"\" class=\"wp-image-659603\"\/><\/figure>\n<p class=\"wp-block-paragraph\">Columns from 0 to 9 accommodates the ten SKUs within the result_skus column. This code utilizing the apply perform isn&#8217;t a vectorized operation. <\/p>\n<p class=\"wp-block-paragraph\">We have now an alternative choice, which is vectorized and far sooner.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">new_cols = pd.DataFrame(new_data[\"result_skus\"].tolist())\n\nnew_data = pd.concat([new_data, new_cols], axis=1)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">This code will give us the identical dataframe as above however a lot sooner.<\/p>\n<p class=\"wp-block-paragraph\">I demonstrated a typical knowledge cleansing and processing activity an information scientist or analyst could encounter of their job. I\u2019ve been within the discipline for over 5 years and Pandas has all the time been sufficient to do what I would like apart from when working very giant datasets (e.g. billions of rows). <\/p>\n<p class=\"wp-block-paragraph\">The instruments which can be higher match for such giant datasets have related syntax to Pandas. For instance, PySpark is form of a mix of Pandas and SQL. Polars is similar to Pandas by way of syntax. Thus, studying and practicind Pandas continues to be a extremely beneficial talent for anybody working within the knowledge science and AI area.<\/p>\n<p class=\"wp-block-paragraph\">Thanks for studying.<\/p>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>studying knowledge science in 2020, Pandas was probably the most common instruments. Though new instruments concentrate on bettering Pandas\u2019 weaknesses in dealing with very giant datasets, I nonetheless use Pandas for a lot of knowledge cleansing, processing, and evaluation duties. Sure, Pandas offers me a tough time when working with billions of rows, however it&#8217;s [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":14872,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[157,9110,460,3666,9111],"class_list":["post-14870","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-data","tag-goto","tag-isnt","tag-pandas","tag-wrangling"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/14870","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=14870"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/14870\/revisions"}],"predecessor-version":[{"id":14871,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/14870\/revisions\/14871"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/14872"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=14870"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=14870"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=14870"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-05-18 09:05:57 UTC -->