{"id":11545,"date":"2026-02-06T20:11:42","date_gmt":"2026-02-06T20:11:42","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=11545"},"modified":"2026-02-06T20:11:42","modified_gmt":"2026-02-06T20:11:42","slug":"amazon-machine-studying-challenge-gross-sales-information-in-python","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=11545","title":{"rendered":"Amazon Machine Studying Challenge: Gross sales Information in Python"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div id=\"article-start\">\n<p>Machine studying tasks work greatest once they join concept to actual enterprise outcomes. In e-commerce, which means higher income, smoother operations, and happier clients, all pushed by knowledge. By working with reasonable datasets, practitioners find out how fashions flip patterns into choices that truly matter.<\/p>\n<p>This text walks via a full machine studying workflow utilizing an Amazon gross sales dataset, from drawback framing to a submission prepared prediction file. It offers learners a transparent view of how fashions flip insights into enterprise worth, on this article.<\/p>\n<h2 class=\"wp-block-heading\" id=\"h-understanding-the-problem-statement\">Understanding the issue assertion<\/h2>\n<p>Earlier than continuing with the coding half, it&#8217;s important to look as much as the issue assertion and perceive it. The dataset consists of Amazon e-commerce transactions which present genuine on-line buying patterns from precise on-line retail actions.\u00a0<\/p>\n<p>The first goal of this mission is to foretell order outcomes and analyze revenue-driving elements utilizing structured transactional knowledge. The event course of requires us to create a supervised machine studying mannequin which learns from previous transaction knowledge to forecast outcomes on new check datasets.\u00a0<\/p>\n<p><strong>Key Enterprise Questions Addressed<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>Which elements affect the ultimate order quantity?\u00a0<\/li>\n<li>How do reductions, taxes, and delivery prices have an effect on income?\u00a0<\/li>\n<li>Can we predict order standing or whole transaction worth precisely?\u00a0<\/li>\n<li>What insights can companies extract to enhance gross sales efficiency?\u00a0<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\" id=\"h-about-the-dataset\">Concerning the dataset<\/h2>\n<p>The dataset consists of 100,000 e-commerce transactions which comply with Amazon\u2019s transaction model and embody 20 organized knowledge fields. The artificial knowledge displays genuine buyer conduct patterns along with precise enterprise operation processes.\u00a0<\/p>\n<p>The info set accommodates details about worth adjustments throughout completely different product varieties and buyer age teams and their fee choices and their order monitoring statuses. The info set accommodates properties which make it appropriate for machine studying and analytical work and dashboard growth.\u00a0<\/p>\n<div>\n<table style=\"width:100%; border-collapse:collapse; font-family:Arial, sans-serif; font-size:14px;\">\n<thead>\n<tr style=\"background:#eeeeee;\">\n<th style=\"border:1px solid #ccc; padding:10px; text-align:left;\">Part<\/th>\n<th style=\"border:1px solid #ccc; padding:10px; text-align:left;\">Area Identify<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"><strong>Order Particulars<\/strong><\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">OrderID<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"\/>\n<td style=\"border:1px solid #ddd; padding:8px;\">OrderDate<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"\/>\n<td style=\"border:1px solid #ddd; padding:8px;\">OrderStatus<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"\/>\n<td style=\"border:1px solid #ddd; padding:8px;\">SellerID<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"><strong>Buyer Data<\/strong><\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">CustomerID<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"\/>\n<td style=\"border:1px solid #ddd; padding:8px;\">CustomerName<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"\/>\n<td style=\"border:1px solid #ddd; padding:8px;\">Metropolis<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"\/>\n<td style=\"border:1px solid #ddd; padding:8px;\">State<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"\/>\n<td style=\"border:1px solid #ddd; padding:8px;\">Nation<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"><strong>Product Data<\/strong><\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">ProductID<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"\/>\n<td style=\"border:1px solid #ddd; padding:8px;\">ProductName<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"\/>\n<td style=\"border:1px solid #ddd; padding:8px;\">Class<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"\/>\n<td style=\"border:1px solid #ddd; padding:8px;\">Model<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"\/>\n<td style=\"border:1px solid #ddd; padding:8px;\">Amount<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"><strong>Pricing &amp; Income Metrics<\/strong><\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">UnitPrice<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"\/>\n<td style=\"border:1px solid #ddd; padding:8px;\">Low cost<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"\/>\n<td style=\"border:1px solid #ddd; padding:8px;\">Tax<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"\/>\n<td style=\"border:1px solid #ddd; padding:8px;\">ShippingCost<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"\/>\n<td style=\"border:1px solid #ddd; padding:8px;\">TotalAmount<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\"><strong>Cost Particulars<\/strong><\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">PaymentMethod<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<h2 class=\"wp-block-heading\" id=\"h-load-essential-python-libraries\">Load important Python Libraries<\/h2>\n<p>To work on the mannequin growth course of first it requires important Python library imports to deal with knowledge work. The mix of <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.analyticsvidhya.com\/blog\/2022\/08\/the-ultimate-guide-to-pandas-for-data-science\/\" target=\"_blank\" rel=\"noreferrer noopener\">Pandas<\/a> and <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.analyticsvidhya.com\/blog\/2020\/04\/the-ultimate-numpy-tutorial-for-data-science-beginners\/\" target=\"_blank\" rel=\"noreferrer noopener\">NumPy<\/a> will allow us to carry out each knowledge dealing with duties and mathematical calculations. Our visualization wants might be fulfilled via using <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.analyticsvidhya.com\/blog\/2021\/10\/introduction-to-matplotlib-using-python-for-beginners\/\" target=\"_blank\" rel=\"noreferrer noopener\">Matplotlib<\/a> and Seaborn. Scikit-learn supplies features for preprocessing and <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.analyticsvidhya.com\/blog\/2022\/01\/machine-learning-algorithms\/\" target=\"_blank\" rel=\"noreferrer noopener\">ML algorithms<\/a>. Right here is the everyday set of imports:\u00a0<\/p>\n<pre class=\"wp-block-code\"><code>import pandas as pd\u00a0\nimport numpy as np\u00a0\nimport matplotlib.pyplot as plt\u00a0\nimport seaborn as sns\u00a0\nfrom sklearn.model_selection import train_test_split\u00a0\nfrom sklearn.preprocessing import LabelEncoder\u00a0\nfrom sklearn.ensemble import RandomForestClassifier\u00a0\nfrom sklearn.metrics import classification_report, accuracy_score<\/code><\/pre>\n<p>The libraries allow us to carry out 4 principal actions which embody loading CSV knowledge, executing knowledge cleaning and transformation processes, utilizing charts for pattern evaluation, and constructing a classification mannequin.\u00a0<\/p>\n<h2 class=\"wp-block-heading\" id=\"h-load-the-datasets\">Load the datasets<\/h2>\n<p>We are going to import knowledge right into a <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.analyticsvidhya.com\/blog\/2022\/08\/the-ultimate-guide-to-pandas-for-data-science\/\" target=\"_blank\" rel=\"noreferrer noopener\">Pandas<\/a> dataFrame after we full our surroundings setup. The uncooked CSV file undergoes transformation via this step into an analyzable and programmatically manipulatable format.\u00a0<\/p>\n<pre class=\"wp-block-code\"><code>df = pd.read_csv(\"Amazon.csv\")\u00a0\n\nprint(\"Form:\", df.form)<\/code><\/pre>\n<pre class=\"wp-block-preformatted\">Form: (100000, 20)\u00a0<\/pre>\n<p>We have to test the info construction after loading as a result of we want affirmation that it was imported appropriately. The dataset dimensions are checked whereas we seek for any preliminary issues that have an effect on knowledge high quality.\u00a0<\/p>\n<pre class=\"wp-block-code\"><code>print(\"nMissing values:n\", df.isna().sum())\u00a0\n\ndf.head()<\/code><\/pre>\n<pre class=\"wp-block-preformatted\">Lacking values:\u00a0<p>OrderID\u00a0 \u00a0 \u00a0 0\u00a0<br\/>OrderDate\u00a0 \u00a0 0\u00a0<br\/>CustomerID \u00a0 0\u00a0<br\/>CustomerName 0\u00a0<br\/>ProductID\u00a0 \u00a0 0\u00a0<br\/>ProductName\u00a0 0\u00a0<br\/>Class \u00a0 \u00a0 0\u00a0<br\/>Model\u00a0 \u00a0 \u00a0 \u00a0 0\u00a0<br\/>Amount \u00a0 \u00a0 0\u00a0<br\/>UnitPrice\u00a0 \u00a0 0\u00a0<br\/>Low cost \u00a0 \u00a0 0\u00a0<br\/>Tax\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 0\u00a0<br\/>ShippingCost 0\u00a0<br\/>TotalAmount\u00a0 0\u00a0<br\/>PaymentMethod 0\u00a0<br\/>OrderStatus\u00a0 0\u00a0<br\/>Metropolis \u00a0 \u00a0 \u00a0 \u00a0 0\u00a0<br\/>State\u00a0 \u00a0 \u00a0 \u00a0 0\u00a0<br\/>Nation\u00a0 \u00a0 \u00a0 0\u00a0<br\/>SellerID \u00a0 \u00a0 0\u00a0<\/p><p>dtype: int64<\/p><\/pre>\n<div>\n<table style=\"width:100%; border-collapse:collapse; background:#fff; font-family:Arial, sans-serif; font-size:14px;\">\n<thead>\n<tr style=\"background:#eeeeee;\">\n<th style=\"border:1px solid #ccc; padding:10px;\">OrderID<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">OrderDate<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">CustomerID<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">CustomerName<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">ProductID<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">ProductName<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">Class<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">Model<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">Amount<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">UnitPrice<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">Low cost<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">Tax<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">ShippingCost<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">TotalAmount<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">PaymentMethod<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">OrderStatus<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">Metropolis<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">State<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">Nation<\/th>\n<th style=\"border:1px solid #ccc; padding:10px;\">SellerID<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\">ORD0000001<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">2023-01-31<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">CUST001504<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Vihaan Sharma<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">P00014<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Drone Mini<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Books<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">BrightLux<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">3<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">106.59<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">0.00<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">0.00<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">0.09<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">319.86<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Debit Card<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Delivered<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Washington<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">DC<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">India<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">SELL01967<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\">ORD0000002<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">2023-12-30<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">CUST000178<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Pooja Kumar<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">P00040<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Microphone<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Residence &amp; Kitchen<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">UrbanStyle<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">1<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">251.37<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">0.05<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">19.10<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">1.74<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">259.64<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Amazon Pay<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Delivered<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Fort Value<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">TX<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">United States<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">SELL01298<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\">ORD0000003<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">2022-05-10<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">CUST047516<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Sneha Singh<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">P00044<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Energy Financial institution 20000mAh<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Clothes<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">UrbanStyle<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">3<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">35.03<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">0.10<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">7.57<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">5.91<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">108.06<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Debit Card<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Delivered<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Austin<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">TX<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">United States<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">SELL00908<\/td>\n<\/tr>\n<tr>\n<td style=\"border:1px solid #ddd; padding:8px;\">ORD0000004<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">2023-07-18<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">CUST030059<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Vihaan Reddy<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">P00041<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Webcam Full HD<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Residence &amp; Kitchen<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Zenith<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">5<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">33.58<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">0.15<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">11.42<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">5.53<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">159.66<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Money on Supply<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Delivered<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">Charlotte<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">NC<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">India<\/td>\n<td style=\"border:1px solid #ddd; padding:8px;\">SELL01164<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<h2 class=\"wp-block-heading\" id=\"h-data-preprocessing\">Information Preprocessing<\/h2>\n<p><strong>1. Decomposing Date Options<\/strong>\u00a0<\/p>\n<p>Fashions can not do math on a string like \u201c2023-01-31\u201d. The 2 components \u201cMonth: 1\u201d and \u201cYr: 2023\u201d create important numerical attributes which might detect seasonal patterns together with vacation gross sales.\u00a0\u00a0<\/p>\n<pre class=\"wp-block-code\"><code>df[\"OrderDate\"] = pd.to_datetime(df[\"OrderDate\"], errors=\"coerce\")\u00a0\ndf[\"OrderYear\"] = df[\"OrderDate\"].dt.12 months\ndf[\"OrderMonth\"] = df[\"OrderDate\"].dt.month\u00a0\ndf[\"OrderDay\"] = df[\"OrderDate\"].dt.day<\/code><\/pre>\n<p>Now we have efficiently extracted three new options: OrderYear, OrderMonth, and OrderDay. The mannequin learns patterns which present \u201cDecember brings greater gross sales\u201d and \u201cweekend days produce elevated gross sales\u201d.\u00a0\u00a0<\/p>\n<p><strong>2.<\/strong> <strong>Dropping Irrelevant Options<\/strong>\u00a0<\/p>\n<p>The mannequin requires solely particular columns. The distinctive ID identifiers (OrderID, CustomerID) don&#8217;t present predictive info which results in mannequin coaching knowledge memorization via overfitting. We additionally dropped OrderDate since we simply extracted its helpful components.\u00a0\u00a0<\/p>\n<pre class=\"wp-block-code\"><code>cols_to_drop = [\u00a0\n\u00a0\u00a0\u00a0\"OrderID\",\u00a0\n\u00a0\u00a0\u00a0\"CustomerID\",\u00a0\n\u00a0\u00a0\u00a0\"CustomerName\",\u00a0\n\u00a0\u00a0\u00a0\"ProductID\",\u00a0\n\u00a0\u00a0\u00a0\"ProductName\",\u00a0\n\u00a0\u00a0\u00a0\"SellerID\",\u00a0\n\u00a0\u00a0\u00a0\"OrderDate\", \u00a0 # already decomposed\u00a0\n]\u00a0\n\ndf = df.drop(columns=cols_to_drop)<\/code><\/pre>\n<p>The dataframe now accommodates solely important components which create predictive worth. The mannequin now detects widespread patterns via product class and tax charges whereas we take away particular buyer ID info which may create \u201cleakage\u201d and noise.\u00a0<\/p>\n<p><strong>3. Dealing with Lacking Values<\/strong>\u00a0<\/p>\n<p>The preliminary test confirmed no lacking values however we want our methods to deal with real-world circumstances. The mannequin will crash if upcoming knowledge accommodates lacking info. We implement a security web by filling gaps with the median (for numbers) or \u201cUnknown\u201d (for textual content).\u00a0<\/p>\n<pre class=\"wp-block-code\"><code>print(\"nMissing values after transformations:n\", df.isna().sum())\n\n# If any lacking values in numeric columns, fill with median\nnumeric_cols = df.select_dtypes(embody=[\"int64\", \"float64\"]).columns.tolist()\n\nfor col in numeric_cols:\n    if df[col].isna().sum() &gt; 0:\n        df[col] = df[col].fillna(df[col].median())<\/code><\/pre>\n<pre class=\"wp-block-preformatted\">Class        0<br\/>Model           0<br\/>Amount        0<br\/>UnitPrice       0<br\/>Low cost        0<br\/>Tax             0<br\/>ShippingCost    0<br\/>TotalAmount     0<br\/>PaymentMethod   0<br\/>OrderStatus     0<br\/>Metropolis            0<br\/>State           0<br\/>Nation         0<br\/>OrderYear       0<br\/>OrderMonth      0<br\/>OrderDay        0<br\/>dtype: int64<\/pre>\n<pre class=\"wp-block-code\"><code># For categorical columns, fill with \"Unknown\"\ncategorical_cols = df.select_dtypes(embody=[\"object\"]).columns.tolist()\n\nfor col in categorical_cols:\n    df[col] = df[col].fillna(\"Unknown\")\n\nprint(\"nFinal dtypes after cleansing:n\")<\/code><\/pre>\n<pre class=\"wp-block-preformatted\">Class        object<br\/>Model           object<br\/>Amount        int64<br\/>UnitPrice       float64<br\/>Low cost        float64<br\/>Tax             float64<br\/>ShippingCost    float64<br\/>TotalAmount     float64<br\/>PaymentMethod   object<br\/>OrderStatus     object<br\/>Metropolis            object<br\/>State           object<br\/>Nation         object<br\/>OrderYear       int32<br\/>OrderMonth      int32<br\/>OrderDay        int32<br\/>dtype: object<\/pre>\n<p>The pipeline is now bulletproof. The ultimate <em>dtypes<\/em> test confirms that our knowledge is totally prepped: all categorical variables are objects (prepared for encoding) and all numerical variables are <em>int32<\/em> or <em>float64<\/em> (prepared for scaling).\u00a0<\/p>\n<h2 class=\"wp-block-heading\" id=\"h-exploratory-data-analysis-eda\">Exploratory knowledge evaluation (EDA)<\/h2>\n<p>The Information Evaluation course of begins with our preliminary examination of knowledge which we deal with as an interview course of to study concerning the knowledge\u2019s traits. Our investigation consists of three principal components which we use to establish patterns and outliers and look at distributional traits.\u00a0<\/p>\n<p><strong>Statistical Abstract:<\/strong> We have to perceive the mathematical properties of our numerical columns. Are the costs cheap? Are there any damaging values that exist in prohibited areas?\u00a0<\/p>\n<pre class=\"wp-block-code\"><code># 2. Fundamental Information Understanding \/ EDA (light-weight)\u00a0\nprint(\"nDescriptive stats (numeric):n\")\u00a0\n\ndf.describe()<\/code><\/pre>\n<p><strong>The descriptive statistics desk supplies vital context:<\/strong>\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Amount:<\/strong> The measurement goes from 1 to five with three as its common worth. Shoppers who store at retail shops have a tendency to point out this conduct which companies use for his or her B2B purchases.\u00a0<\/li>\n<li><strong>UnitPrice:<\/strong> The value ranges between 5.00 and 599.99 which reveals that there exists a number of product tiers.\u00a0<\/li>\n<\/ul>\n<p>The goal variable TotalAmount reveals large variance as a result of its normal deviation approaches 724 which implies our mannequin should preserve its capability to course of transactions starting from small purchases to most purchases of 3534.98.\u00a0<\/p>\n<p><strong>Categorical Evaluation<\/strong>\u00a0<\/p>\n<p>We have to know the cardinality (variety of distinctive values) of our categorical options. The mannequin experiences bloat and overfitting points as a result of excessive cardinality happens when there are millions of distinctive cities within the dataset.\u00a0<\/p>\n<pre class=\"wp-block-code\"><code>print(\"nUnique values in some categorical columns:\")\n\nfor col in [\"Category\", \"Brand\", \"PaymentMethod\", \"OrderStatus\", \"Country\"]:\n    print(f\"{col}: {df[col].nunique()} distinctive\")<\/code><\/pre>\n<pre class=\"wp-block-preformatted\">Distinctive values in some categorical columns:\u00a0<p>Class: 6 distinctive\u00a0<br\/>Model: 10 distinctive\u00a0<br\/>PaymentMethod: 6 distinctive\u00a0<br\/>OrderStatus: 5 distinctive\u00a0<br\/>Nation: 5 distinctive<\/p><\/pre>\n<p><strong>Visualizing the Goal Distribution<\/strong>\u00a0<\/p>\n<p>The histogram reveals the frequency of various transaction quantities. A easy curve (KDE) permits us to see the density. With the curve being barely proper skewed subsequently tree-based fashions like Random Forest deal with very properly.\u00a0<\/p>\n<pre class=\"wp-block-code\"><code>sns.histplot(df[\"TotalAmount\"], kde=True)\nplt.title(\"TotalAmount distribution\")\nplt.present()<\/code><\/pre>\n<p>The TotalAmount visualization permits us to find out whether or not the info displays any skewed distribution. The info requires a Log Transformation when it reveals excessive skewness with only some high-priced merchandise and quite a few low-cost objects.\u00a0<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img fetchpriority=\"high\" decoding=\"async\" width=\"580\" height=\"453\" src=\"https:\/\/cdn.analyticsvidhya.com\/wp-content\/uploads\/2026\/01\/image3-7.webp\" alt=\"Bar Graph \" class=\"wp-image-250603\" srcset=\"https:\/\/cdn.analyticsvidhya.com\/wp-content\/uploads\/2026\/01\/image3-7.webp 580w, https:\/\/cdn.analyticsvidhya.com\/wp-content\/uploads\/2026\/01\/image3-7-300x234.webp 300w, https:\/\/cdn.analyticsvidhya.com\/wp-content\/uploads\/2026\/01\/image3-7-150x117.webp 150w\" sizes=\"(max-width: 580px) 100vw, 580px\"\/><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\" id=\"h-feature-engineering\">Function Engineering<\/h2>\n<p>Function engineering develops new variables via the method of reworking current variables to spice up mannequin efficiency. In Supervised Studying, we should explicitly inform the mannequin what to foretell (y) and what knowledge to make use of to make that prediction (X).\u00a0<\/p>\n<pre class=\"wp-block-code\"><code>target_column = \"TotalAmount\"\n\nX = df.drop(columns=[target_column])\ny = df[target_column]\n\nnumeric_features = X.select_dtypes(embody=[\"int64\", \"float64\"]).columns.tolist()\ncategorical_features = X.select_dtypes(embody=[\"object\"]).columns.tolist()\n\nprint(\"nNumeric options:\", numeric_features)\nprint(\"Categorical options:\", categorical_features)<\/code><\/pre>\n<h2 class=\"wp-block-heading\" id=\"h-splitting-the-train-and-test-data-nbsp\">Splitting the practice and check knowledge\u00a0<\/h2>\n<p>The mannequin analysis course of requires separate knowledge as a result of coaching knowledge can&#8217;t be used for evaluation, which parallels the apply of offering college students with examination solutions earlier than the check. The info distribution consists of two components: Coaching Set which serves academic functions and Take a look at Set which verifies outcomes.\u00a0<\/p>\n<pre class=\"wp-block-code\"><code>X_train, X_test, y_train, y_test = train_test_split(\n    X, y, test_size=0.2, random_state=42\n)\n\nprint(\"nTrain form:\", X_train.form, \"Take a look at form:\", X_test.form)<\/code><\/pre>\n<p>Right here we now have used the 80-20 % rule, which implies randomly out of all the info we now have 80% might be used because the practice knowledge and the remaining 20% might be used to check it because the check knowledge set.\u00a0<\/p>\n<h2 class=\"wp-block-heading\" id=\"h-build-machine-learning-model\">Construct Machine Studying Mannequin<\/h2>\n<p>Creating the ML pipeline would concerned the next processes:<\/p>\n<p><strong>1.<\/strong> <strong>Creating Preprocessing Pipelines<\/strong>\u00a0<\/p>\n<p>The uncooked numbers of every measurement scale in a different way as a result of they embody measurements that vary from 1 to five for Amount and from 5 to 500 for Worth. The fashions obtain sooner convergence when researchers implement knowledge scaling strategies. One-Sizzling Encoding supplies the required methodology to remodel categorical textual content into numerical format. The <code>ColumnTransformer<\/code> system permits us to use completely different transformation strategies for each column sort in our dataset.\u00a0<\/p>\n<pre class=\"wp-block-code\"><code>numeric_transformer = Pipeline(\n    steps=[\n        (\"scaler\", StandardScaler())\n    ]\n)\n\ncategorical_transformer = Pipeline(\n    steps=[\n        (\"onehot\", OneHotEncoder(handle_unknown=\"ignore\"))\n    ]\n)\n\npreprocessor = ColumnTransformer(\n    transformers=[\n        (\"num\", numeric_transformer, numeric_features),\n        (\"cat\", categorical_transformer, categorical_features),\n    ]\n)<\/code><\/pre>\n<p><strong>2.<\/strong> <strong>Defining the Random Forest Mannequin<\/strong>\u00a0<\/p>\n<p>Now we have chosen the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.analyticsvidhya.com\/blog\/2021\/06\/understanding-random-forest\/\" target=\"_blank\" rel=\"noreferrer noopener\">Random Forest<\/a> Regressor for this mission. The ensemble methodology constructs a number of determination bushes which it makes use of to compute forecast outcomes via prediction averaging. The system demonstrates robust robustness towards overfitting issues whereas it excels at managing non-linear connections between variables.\u00a0<\/p>\n<pre class=\"wp-block-code\"><code>mannequin = RandomForestRegressor(\n    n_estimators=200,\n    max_depth=None,\n    random_state=42,\n    n_jobs=-1\n)\n\n# Full pipeline\nregressor = Pipeline(\n    steps=[\n        (\"preprocessor\", preprocessor),\n        (\"model\", model),\n    ]\n)<\/code><\/pre>\n<p>We created the mannequin with <code>n_estimators=200<\/code> to construct 200 determination bushes and <code>n_jobs=-1<\/code> to allow all CPU cores for speedier mannequin growth. The very best apply for this implementation requires customers to create a single Pipeline object which mixes the preprocessor and mannequin to deal with their complete operational course of as one unit.\u00a0<\/p>\n<p><strong>3. Coaching the Mannequin<\/strong>\u00a0<\/p>\n<p>This stage represents the first studying course of. The pipeline processes coaching knowledge via transformation steps earlier than it makes use of the Random Forest mannequin on the transformed knowledge.\u00a0<\/p>\n<pre class=\"wp-block-code\"><code>regressor.match(X_train, y_train)\u00a0\n\nprint(\"nModel coaching full.\")<\/code><\/pre>\n<p>The mannequin now understands how completely different enter variables (Class Worth Tax and many others.) relate to the output variable (Complete Quantity).\u00a0<\/p>\n<h2 class=\"wp-block-heading\" id=\"h-make-predictions-on-the-test-dataset\">Make predictions on the check dataset<\/h2>\n<p>Now we check the mannequin on the check knowledge (i.e, 20,000 \u201cunseen\u201d data). The mannequin efficiency evaluation makes use of statistical metrics to check its predicted outcomes (<code>y_pred<\/code>) with the precise outcomes (<code>y_test<\/code>).\u00a0<\/p>\n<pre class=\"wp-block-code\"><code>y_pred = regressor.predict(X_test)\n\nmae = mean_absolute_error(y_test, y_pred)\nmse = mean_squared_error(y_test, y_pred)\nrmse = np.sqrt(mse)\nr2 = r2_score(y_test, y_pred)\n\nprint(\"nTest metrics:\")\nprint(\"MAE :\", mae)\nprint(\"MSE :\", mse)\nprint(\"RMSE:\", rmse)\nprint(\"R2  :\", r2)<\/code><\/pre>\n<pre class=\"wp-block-preformatted\">Take a look at metrics:\u00a0<p>MAE : 3.886121525000014\u00a0<br\/>MSE : 41.06268576375389\u00a0<br\/>RMSE: 6.408017303640331\u00a0<br\/>R2\u00a0 : 0.99992116450905<\/p><\/pre>\n<p><strong>This signifies:<\/strong>\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li>The Imply Absolute Error (MAE) worth stands at roughly 3.88. Our prediction reveals a median error of $3.88.\u00a0<\/li>\n<li>The R2 Rating worth stands at roughly 0.9999. That is close to good. The unbiased variables (Worth, Tax, Delivery) virtually completely account for the Complete Quantity in line with this outcome. The Complete method in artificial monetary knowledge follows the equation <em>Complete = Worth * Qty + Tax + Delivery \u2013 Low cost<\/em>.\u00a0<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\" id=\"h-prepare-submission-file\">Put together submission file<\/h2>\n<p>The system requires contributors to current their predictions in line with predetermined output specs which should not be altered.\u00a0\u00a0<\/p>\n<pre class=\"wp-block-code\"><code>submission = pd.DataFrame({\u00a0\n\u00a0\u00a0\u00a0\"OrderID\": df.loc[X_test.index, \"OrderID\"],\u00a0\n\u00a0\u00a0\u00a0\"PredictedTotalAmount\": y_pred\u00a0\n})\u00a0\n\nsubmission.to_csv(\"submission.csv\", index=False)<\/code><\/pre>\n<p>The analysis system accepts this file for direct submission whereas stakeholders can even obtain it.\u00a0<\/p>\n<h2 class=\"wp-block-heading\" id=\"h-conclusion\">Conclusion<\/h2>\n<p>This machine studying mission demonstrates its full course of via demonstration of uncooked e-commerce transaction knowledge transformation into helpful predictive outcomes. The structured workflow methodology lets you handle precise datasets with full assurance and understanding of the method. The success of the mission depends upon the 5 steps which embody preprocessing and EDA and have engineering and modeling.\u00a0<\/p>\n<p>The mission helps in creating your machine studying capabilities whereas coaching you to deal with actual work conditions. The pipeline wants extra optimization work earlier than it will probably perform as a advice system with superior fashions or deep studying strategies.\u00a0<\/p>\n<h2 class=\"wp-block-heading\" id=\"h-frequently-asked-questions\">Ceaselessly Requested Questions<\/h2>\n<div class=\"schema-faq wp-block-yoast-faq-block\">\n<div class=\"schema-faq-section\" id=\"faq-question-1769767822194\"><strong class=\"schema-faq-question\">Q1. What&#8217;s the principal purpose of this Amazon gross sales machine studying mission?<\/strong> <\/p>\n<p class=\"schema-faq-answer\">A. It goals to foretell the full order quantity utilizing transactional and pricing knowledge.<\/p>\n<\/p><\/div>\n<div class=\"schema-faq-section\" id=\"faq-question-1769767846865\"><strong class=\"schema-faq-question\">Q2. Why was a Random Forest mannequin chosen for this mission?<\/strong> <\/p>\n<p class=\"schema-faq-answer\">A. It captures complicated patterns and reduces overfitting by combining many determination bushes.<\/p>\n<\/p><\/div>\n<div class=\"schema-faq-section\" id=\"faq-question-1769767857188\"><strong class=\"schema-faq-question\">Q3. What does the ultimate submission file include?<\/strong> <\/p>\n<p class=\"schema-faq-answer\">A. It consists of OrderID and the mannequin\u2019s predicted whole quantity for every order.<\/p>\n<\/p><\/div><\/div>\n<div class=\"border-top py-3 author-info my-4\">\n<div class=\"author-card d-flex align-items-center\">\n<div class=\"flex-shrink-0 overflow-hidden\">\n                                    <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.analyticsvidhya.com\/blog\/author\/vipin355333\/\" class=\"text-decoration-none active-avatar\"><br \/>\n                                                                       <img decoding=\"async\" src=\"https:\/\/av-eks-lekhak.s3.amazonaws.com\/media\/lekhak-profile-images\/converted_image_q6dapDN.webp\" width=\"48\" height=\"48\" alt=\"Vipin Vashisth\" loading=\"lazy\" class=\"rounded-circle\"\/><\/p>\n<p>                                <\/a>\n                                <\/div><\/div>\n<p>Hey! I am Vipin, a passionate knowledge science and machine studying fanatic with a robust basis in knowledge evaluation, machine studying algorithms, and programming. I&#8217;ve hands-on expertise in constructing fashions, managing messy knowledge, and fixing real-world issues. My purpose is to use data-driven insights to create sensible options that drive outcomes. I am desperate to contribute my abilities in a collaborative atmosphere whereas persevering with to study and develop within the fields of Information Science, Machine Studying, and NLP.<\/p>\n<\/p><\/div><\/div>\n<p><h4 class=\"fs-24 text-dark\">Login to proceed studying and revel in expert-curated content material.<\/h4>\n<p>                        <button class=\"btn btn-primary mx-auto d-table\" data-bs-toggle=\"modal\" data-bs-target=\"#loginModal\" id=\"readMoreBtn\">Maintain Studying for Free<\/button>\n                    <\/p>\n\n","protected":false},"excerpt":{"rendered":"<p>Machine studying tasks work greatest once they join concept to actual enterprise outcomes. In e-commerce, which means higher income, smoother operations, and happier clients, all pushed by knowledge. By working with reasonable datasets, practitioners find out how fashions flip patterns into choices that truly matter. This text walks via a full machine studying workflow utilizing [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":11547,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[387,157,136,113,1640,1258,1987],"class_list":["post-11545","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-amazon","tag-data","tag-learning","tag-machine","tag-project","tag-python","tag-sales"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/11545","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=11545"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/11545\/revisions"}],"predecessor-version":[{"id":11546,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/11545\/revisions\/11546"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/11547"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=11545"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=11545"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=11545"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-05-13 16:34:58 UTC -->