{"id":7755,"date":"2025-10-17T00:23:16","date_gmt":"2025-10-17T00:23:16","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=7755"},"modified":"2025-10-17T00:23:16","modified_gmt":"2025-10-17T00:23:16","slug":"how-i-constructed-a-knowledge-cleansing-pipeline-utilizing-one-messy-doordash-dataset","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=7755","title":{"rendered":"How I Constructed a Knowledge Cleansing Pipeline Utilizing One Messy DoorDash Dataset"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div id=\"post-\">\n<p>    <center><img decoding=\"async\" alt=\"How I Built a Data Cleaning Pipeline Using One Messy DoorDash Dataset\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-1-scaled.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-1-scaled.png\" alt=\"How I Built a Data Cleaning Pipeline Using One Messy DoorDash Dataset\" width=\"100%\"\/><br \/><span>Picture by Editor<\/span><\/center><br \/>\n\u00a0<\/p>\n<h2><span>#\u00a0<\/span>Introduction<\/h2>\n<p>\u00a0<br \/>In keeping with <strong><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www2.cs.uh.edu\/~ceick\/UDM\/CFDS16.pdf\" target=\"_blank\">CrowdFlower\u2019s survey<\/a><\/strong>, information scientists spend 60% of their time organizing and cleansing the info.<\/p>\n<p>On this article, we\u2019ll stroll via constructing an information cleansing pipeline utilizing a real-life dataset from DoorDash. It comprises almost 200,000 meals supply data, every of which incorporates dozens of options resembling supply time, complete gadgets, and retailer class (e.g., Mexican, Thai, or American delicacies).<\/p>\n<p>\u00a0<\/p>\n<h2><span>#\u00a0<\/span>Predicting Meals Supply Occasions with DoorDash Knowledge<\/h2>\n<p>\u00a0<br \/><img decoding=\"async\" alt=\"Predicting Food Delivery Times with DoorDash Data\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-2.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-2.png\" alt=\"Predicting Food Delivery Times with DoorDash Data\" width=\"100%\"\/><br \/>\u00a0<br \/>DoorDash goals to estimate the time it takes to ship meals precisely, from the second a buyer locations an order to the time it arrives at their door. In <strong><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/platform.stratascratch.com\/data-projects\/delivery-duration-prediction?utm_source=blog&amp;utm_medium=click&amp;utm_campaign=kdn+building+data+cleaning+pipeline\" target=\"_blank\">this information undertaking<\/a><\/strong>, we&#8217;re tasked with creating a mannequin that predicts the overall supply length primarily based on historic supply information.<\/p>\n<p>Nonetheless, we gained\u2019t do the entire undertaking\u2014i.e., we gained\u2019t construct a predictive mannequin. As a substitute, we\u2019ll use the dataset supplied within the undertaking and create an information cleansing pipeline.<\/p>\n<p>Our workflow consists of two main steps.<\/p>\n<p>\u00a0<br \/><img decoding=\"async\" alt=\"Data Cleaning Pipeline\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-3-scaled.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-3-scaled.png\" alt=\"Data Cleaning Pipeline\" width=\"100%\"\/><br \/>\u00a0<\/p>\n<p>\u00a0<\/p>\n<h2><span>#\u00a0<\/span>Knowledge Exploration<\/h2>\n<p>\u00a0<br \/><img decoding=\"async\" alt=\"Data Cleaning Pipeline\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-4-scaled.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-4-scaled.png\" alt=\"Data Cleaning Pipeline\" width=\"100%\"\/><br \/>\u00a0<\/p>\n<p>Let\u2019s begin by loading and viewing the primary few rows of the dataset.<\/p>\n<p>\u00a0<\/p>\n<h4><span>\/\/\u00a0<\/span>Load and Preview the Dataset<\/h4>\n<div style=\"width: 98%;overflow: auto;padding-left: 10px;padding-bottom: 10px;padding-top: 10px;background: #F5F5F5\">\n<pre><code>import pandas as pd&#13;\ndf = pd.read_csv(\"historical_data.csv\")&#13;\ndf.head()<\/code><\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>Right here is the output.<\/p>\n<p>\u00a0<br \/><img decoding=\"async\" alt=\"Data Cleaning Pipeline\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-5.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-5.png\" alt=\"Data Cleaning Pipeline\" width=\"100%\"\/><br \/>\u00a0<\/p>\n<p>This dataset contains datetime columns that seize the order creation time and precise supply time, which can be utilized to calculate supply length. It additionally comprises different options resembling retailer class, complete merchandise depend, subtotal, and minimal merchandise worth, making it appropriate for varied kinds of information evaluation. We will already see that there are some NaN values, which we\u2019ll discover extra carefully within the following step.<\/p>\n<p>\u00a0<\/p>\n<h4><span>\/\/\u00a0<\/span>Discover The Columns With <code style=\"background: #F5F5F5;\">data()<\/code><\/h4>\n<p>Let\u2019s examine all column names with the <strong><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.info.html\" target=\"_blank\">data()<\/a><\/strong> technique. We&#8217;ll use this technique all through the article to see the modifications in column worth counts; it\u2019s a great indicator of lacking information and general information well being.<\/p>\n<p>\u00a0<\/p>\n<p>Right here is the output.<\/p>\n<p>\u00a0<br \/><img decoding=\"async\" alt=\"Data Cleaning Pipeline\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-6.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-6.png\" alt=\"Data Cleaning Pipeline\" width=\"100%\"\/><br \/>\u00a0<\/p>\n<p>As you&#8217;ll be able to see, we&#8217;ve got 15 columns, however the variety of non-null values differs throughout them. This implies some columns comprise lacking values, which might have an effect on our evaluation if not dealt with correctly. One last item: the <em>created_at<\/em> and <em>actual_delivery_time<\/em> information varieties are objects; these must be datetime.<\/p>\n<p>\u00a0<\/p>\n<h2><span>#\u00a0<\/span>Constructing Knowledge Cleansing Pipeline<\/h2>\n<p>\u00a0<br \/>On this step, we construct a structured information cleansing pipeline to arrange the dataset for modeling. Every stage addresses frequent points resembling timestamp formatting, lacking values, and irrelevant options.<br \/>\u00a0<br \/><img decoding=\"async\" alt=\"Building Data Cleaning Pipeline\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-7-scaled.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-7-scaled.png\" alt=\"Building Data Cleaning Pipeline\" width=\"100%\"\/><br \/>\u00a0<\/p>\n<h4><span>\/\/\u00a0<\/span>Fixing the Date and Time Columns Knowledge Varieties<\/h4>\n<p>Earlier than doing information evaluation, we have to repair the columns that present the time. In any other case, the calculation that we talked about (<em>actual_delivery_time &#8211; created_at<\/em>) will go flawed.<\/p>\n<p>What we\u2019re fixing:<\/p>\n<ul>\n<li><em>created_at<\/em>: when the order was positioned\n<\/li>\n<li><em>actual_delivery_time<\/em>: when the meals arrived\n<\/li>\n<\/ul>\n<p>These two columns are saved as objects, so to have the ability to do calculations accurately, we&#8217;ve got to transform them to the datetime format. To do this, we are able to use datetime capabilities in <strong><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/pandas.pydata.org\/\" target=\"_blank\">pandas<\/a><\/strong>. Right here is the code.<\/p>\n<div style=\"width: 98%;overflow: auto;padding-left: 10px;padding-bottom: 10px;padding-top: 10px;background: #F5F5F5\">\n<pre><code>import pandas as pd&#13;\ndf = pd.read_csv(\"historical_data.csv\")&#13;\n# Convert timestamp strings to datetime objects&#13;\ndf[\"created_at\"] = pd.to_datetime(df[\"created_at\"], errors=\"coerce\")&#13;\ndf[\"actual_delivery_time\"] = pd.to_datetime(df[\"actual_delivery_time\"], errors=\"coerce\")&#13;\ndf.data()<\/code><\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>Right here is the output.<\/p>\n<p>\u00a0<br \/><img decoding=\"async\" alt=\"Building Data Cleaning Pipeline\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-8.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-8.png\" alt=\"Building Data Cleaning Pipeline\" width=\"100%\"\/><br \/>\u00a0<\/p>\n<p>As you&#8217;ll be able to see from the screenshot above, the <em>created_at<\/em> and <em>actual_delivery_time<\/em> are datetime objects now.<\/p>\n<p>\u00a0<br \/><img decoding=\"async\" alt=\"Building Data Cleaning Pipeline\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-9.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-9.png\" alt=\"Building Data Cleaning Pipeline\" width=\"100%\"\/><br \/>\u00a0<\/p>\n<p>Among the many key columns, <em>store_primary_category<\/em> has the fewest non-null values (192,668), which implies it has essentially the most lacking information. That\u2019s why we\u2019ll deal with cleansing it first.<\/p>\n<p>\u00a0<\/p>\n<h4><span>\/\/\u00a0<\/span>Knowledge Imputation With <code style=\"background: #F5F5F5;\">mode()<\/code><\/h4>\n<p>One of many messiest columns within the dataset, evident from its excessive variety of lacking values, is <em>store_primary_category<\/em>. It tells us what sort of meals shops can be found, like Mexican, American, and Thai. Nonetheless, many rows are lacking this data, which is an issue. As an example, it will possibly restrict how we are able to group or analyze the info. So how can we repair it?<\/p>\n<p>We&#8217;ll fill these rows as an alternative of dropping them. To do this, we are going to use smarter imputation.<\/p>\n<p>We write a dictionary that maps every <em>store_id<\/em> to its most frequent class, after which use that mapping to fill in lacking values. Let\u2019s see the dataset earlier than doing that.<\/p>\n<p>\u00a0<br \/><img decoding=\"async\" alt=\"Data Imputation With mode\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-10.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-10.png\" alt=\"Data Imputation With mode\" width=\"100%\"\/><br \/>\u00a0<\/p>\n<p>Right here is the code.<\/p>\n<div style=\"width: 98%;overflow: auto;padding-left: 10px;padding-bottom: 10px;padding-top: 10px;background: #F5F5F5\">\n<pre><code>import numpy as np&#13;\n&#13;\n# World most-frequent class as a fallback&#13;\nglobal_mode = df[\"store_primary_category\"].mode().iloc[0]&#13;\n&#13;\n# Construct store-level mapping to essentially the most frequent class (quick and strong)&#13;\nstore_mode = (&#13;\n    df.groupby(\"store_id\")[\"store_primary_category\"]&#13;\n      .agg(lambda s: s.mode().iloc[0] if not s.mode().empty else np.nan)&#13;\n)&#13;\n&#13;\n# Fill lacking classes utilizing the store-level mode, then fall again to world mode&#13;\ndf[\"store_primary_category\"] = (&#13;\n    df[\"store_primary_category\"]&#13;\n      .fillna(df[\"store_id\"].map(store_mode))&#13;\n      .fillna(global_mode)&#13;\n)&#13;\n&#13;\ndf.data()<\/code><\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>Right here is the output.<\/p>\n<p>\u00a0<br \/><img decoding=\"async\" alt=\"Data Imputation With mode\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-11.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-11.png\" alt=\"Data Imputation With mode\" width=\"100%\"\/><br \/>\u00a0<\/p>\n<p>As you&#8217;ll be able to see from the screenshot above, the <em>store_primary_category<\/em> column now has a better non-null depend. However let\u2019s double-check with this code.<\/p>\n<div style=\"width: 98%;overflow: auto;padding-left: 10px;padding-bottom: 10px;padding-top: 10px;background: #F5F5F5\">\n<pre><code>df[\"store_primary_category\"].isna().sum()<\/code><\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>Right here is the output displaying the variety of NaN values. It\u2019s zero; we removed all of them.<\/p>\n<p>\u00a0<br \/><img decoding=\"async\" alt=\"Data Imputation With mode\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-12.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-12.png\" alt=\"Data Imputation With mode\" width=\"100%\"\/><br \/>\u00a0<\/p>\n<p>And let\u2019s see the dataset after the imputation.<\/p>\n<p>\u00a0<br \/><img decoding=\"async\" alt=\"Data Imputation With mode\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-13.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-13.png\" alt=\"Data Imputation With mode\" width=\"100%\"\/><\/p>\n<p>\u00a0<\/p>\n<h4><span>\/\/\u00a0<\/span>Dropping Remaining NaNs<\/h4>\n<p>Within the earlier step, we corrected the <em>store_primary_category<\/em>, however did you discover one thing? The non-null counts throughout the columns nonetheless don\u2019t match!<\/p>\n<p>It is a clear signal that we\u2019re nonetheless coping with lacking values in some a part of the dataset. Now, on the subject of information cleansing, we&#8217;ve got two choices:<\/p>\n<ul>\n<li>Fill these lacking values\n<\/li>\n<li>Drop them\n<\/li>\n<\/ul>\n<p>On condition that this dataset comprises almost 200,000 rows, we are able to afford to lose some. With smaller datasets, you\u2019d should be extra cautious. In that case, it&#8217;s advisable to research every column, set up requirements (determine how lacking values will likely be stuffed\u2014utilizing the imply, median, most frequent worth, or domain-specific defaults), after which fill them.<\/p>\n<p>To take away the NaNs, we are going to use the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.dropna.html\" target=\"_blank\">dropna()<\/a> technique from the pandas library. We&#8217;re setting <em>inplace=True<\/em> to use the modifications on to the DataFrame with no need to assign it once more. Let\u2019s see the dataset at this level.<\/p>\n<p>\u00a0<br \/><img decoding=\"async\" alt=\"Dropping NaNs\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-14.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-14.png\" alt=\"Dropping NaNs\" width=\"100%\"\/><br \/>\u00a0<\/p>\n<p>Right here is the code.<\/p>\n<div style=\"width: 98%;overflow: auto;padding-left: 10px;padding-bottom: 10px;padding-top: 10px;background: #F5F5F5\">\n<pre><code>df.dropna(inplace=True)&#13;\ndf.data()<\/code><\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>Right here is the output.<\/p>\n<p>\u00a0<br \/><img decoding=\"async\" alt=\"Dropping NaNs\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-15.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-15.png\" alt=\"Dropping NaNs\" width=\"100%\"\/><br \/>\u00a0<\/p>\n<p>As you&#8217;ll be able to see from the screenshot above, every column now has the identical variety of non-null values.<\/p>\n<p>Let\u2019s see the dataset after all of the modifications.<\/p>\n<p>\u00a0<br \/><img decoding=\"async\" alt=\"Dropping NaNs\" width=\"100%\" class=\"perfmatters-lazy\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-16.png\"\/><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Rosidi-How_I_Built_a_Data_Cleaning_Pipeline-16.png\" alt=\"Dropping NaNs\" width=\"100%\"\/><br \/>\u00a0<\/p>\n<p>\u00a0<\/p>\n<h4><span>\/\/\u00a0<\/span>What Can You Do Subsequent?<\/h4>\n<p>Now that we&#8217;ve got a clear dataset, right here are some things you are able to do subsequent:<\/p>\n<ul>\n<li><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.stratascratch.com\/blog\/how-to-perform-exploratory-data-analysis-in-python\/?utm_source=blog&amp;utm_medium=click&amp;utm_campaign=kdn+building+data+cleaning+pipeline\" target=\"_blank\">Carry out EDA<\/a> to know supply patterns.\n<\/li>\n<li>Engineer new options like supply hours or busy dashers ratio so as to add extra which means to your evaluation.\n<\/li>\n<li><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.stratascratch.com\/guides\/sql-data-manipulation-skills\/how-to-calculate-a-correlation-between-two-sets-of-values-sql\/?utm_source=blog&amp;utm_medium=click&amp;utm_campaign=kdn+building+data+cleaning+pipeline\" target=\"_blank\">Analyze correlations<\/a> between variables to extend your mannequin&#8217;s efficiency.\n<\/li>\n<li><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.stratascratch.com\/blog\/overview-of-machine-learning-algorithms-regression\/?utm_source=blog&amp;utm_medium=click&amp;utm_campaign=kdn+building+data+cleaning+pipeline\" target=\"_blank\">Construct completely different regression fashions<\/a> and discover the best-performing mannequin.\n<\/li>\n<li>Predict the supply length with the best-performing mannequin.\n<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h2><span>#\u00a0<\/span>Remaining Ideas<\/h2>\n<p>\u00a0<br \/>On this article, we&#8217;ve got cleaned the real-life dataset from DoorDash by addressing frequent information high quality points, resembling fixing incorrect information varieties and dealing with lacking values. We constructed a easy information cleansing pipeline tailor-made to this information undertaking and explored potential subsequent steps.<\/p>\n<p>Actual-world datasets could be messier than you assume, however there are additionally many strategies and methods to unravel these points. Thanks for studying!<br \/>\u00a0<br \/>\u00a0<\/p>\n<p><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/twitter.com\/StrataScratch\" rel=\"noopener\"><b><strong><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/twitter.com\/StrataScratch\" target=\"_blank\" rel=\"noopener noreferrer\">Nate Rosidi<\/a><\/strong><\/b><\/a> is an information scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to information scientists put together for his or her interviews with actual interview questions from prime corporations. Nate writes on the most recent developments within the profession market, offers interview recommendation, shares information science initiatives, and covers every part SQL.<\/p>\n<\/p><\/div>\n<p><template id="VbEBMwWh7hFkmkRNfOPU"></template><\/script><br \/>\n<br \/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Picture by Editor \u00a0 #\u00a0Introduction \u00a0In keeping with CrowdFlower\u2019s survey, information scientists spend 60% of their time organizing and cleansing the info. On this article, we\u2019ll stroll via constructing an information cleansing pipeline utilizing a real-life dataset from DoorDash. It comprises almost 200,000 meals supply data, every of which incorporates dozens of options resembling supply [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":7757,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[1007,5930,157,3195,2934,4801,2594],"class_list":["post-7755","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-built","tag-cleaning","tag-data","tag-dataset","tag-doordash","tag-messy","tag-pipeline"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/7755","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7755"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/7755\/revisions"}],"predecessor-version":[{"id":7756,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/7755\/revisions\/7756"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/7757"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7755"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7755"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7755"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-06-15 10:41:48 UTC -->