{"id":8031,"date":"2025-10-25T09:15:19","date_gmt":"2025-10-25T09:15:19","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=8031"},"modified":"2025-10-25T09:15:19","modified_gmt":"2025-10-25T09:15:19","slug":"agentic-ai-from-first-rules-reflection","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=8031","title":{"rendered":"Agentic AI from First Rules: Reflection"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p class=\"wp-block-paragraph\"> says that \u201c<em>any sufficiently superior expertise is indistinguishable from magic<\/em>\u201d. That\u2019s precisely how lots of right now\u2019s AI frameworks really feel. Instruments like GitHub Copilot, Claude Desktop, OpenAI Operator, and Perplexity Comet are automating on a regular basis duties that will\u2019ve appeared not possible to automate simply 5 years in the past. What\u2019s much more outstanding is that with just some traces of code, we are able to construct our personal subtle AI instruments: ones that search via information, browse the online, click on hyperlinks, and even make purchases. It actually does really feel like magic.<\/p>\n<p class=\"wp-block-paragraph\">Although I <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/towardsdatascience.com\/i-think-of-analysts-as-data-wizards-who-help-their-product-teams-solve-problems\/\" rel=\"noreferrer noopener\" target=\"_blank\">genuinely imagine in knowledge wizards<\/a>, I don\u2019t imagine in magic. I discover it thrilling (and sometimes useful) to know how issues are literally constructed and what\u2019s taking place below the hood. That\u2019s why I\u2019ve determined to share a collection of posts on agentic AI design ideas that\u2019ll make it easier to perceive how all these magical instruments truly work.<\/p>\n<p class=\"wp-block-paragraph\">To achieve a deep understanding, we\u2019ll construct a multi-AI agent system from scratch. We\u2019ll keep away from utilizing frameworks like CrewAI or smolagents and as a substitute work immediately with the muse mannequin API. Alongside the best way, we\u2019ll discover the elemental agentic design patterns: reflection, device use, planning, and multi-agent setups. Then, we\u2019ll mix all this information to construct a multi-AI agent system that may reply advanced data-related questions.<\/p>\n<p class=\"wp-block-paragraph\">As Richard Feynman put it, \u201c<em>What I can&#8217;t create, I don&#8217;t perceive<\/em>.\u201d So let\u2019s begin constructing! On this article, we\u2019ll deal with the reflection design sample. However first, let\u2019s determine what precisely reflection is.<\/p>\n<h2 class=\"wp-block-heading\">What reflection is<\/h2>\n<p class=\"wp-block-paragraph\">Let\u2019s mirror on how we (people) often work on duties. Think about I must share the outcomes of a current characteristic launch with my PM. I\u2019ll possible put collectively a fast draft after which learn it a few times from starting to finish, guaranteeing that each one elements are constant, there\u2019s sufficient info, and there are not any typos.<\/p>\n<p class=\"wp-block-paragraph\">Or let\u2019s take one other instance: writing a SQL question. I\u2019ll both write it step-by-step, checking the intermediate outcomes alongside the best way, or (if it\u2019s easy sufficient) I\u2019ll draft it suddenly, execute it, have a look at the outcome (checking for errors or whether or not the outcome matches my expectations), after which tweak the question based mostly on that suggestions. I&#8217;d rerun it, test the outcome, and iterate till it\u2019s proper.<\/p>\n<p class=\"wp-block-paragraph\">So we hardly ever write lengthy texts from prime to backside in a single go. We often circle again, evaluate, and tweak as we go. These suggestions loops are what assist us enhance the standard of our work.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-23-at-08.57.53-1024x789.png\" alt=\"\" class=\"wp-image-627319\"\/><figcaption class=\"wp-element-caption\">Picture by writer<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">LLMs use a special strategy. When you ask an LLM a query, by default, it&#8217;ll generate a solution token by token, and the LLM gained\u2019t be capable to evaluate its outcome and repair any points. However in an agentic AI setup, we are able to create suggestions loops for LLMs too, both by asking the LLM to evaluate and enhance its personal reply or by sharing exterior suggestions with it (just like the outcomes of a SQL execution). And that\u2019s the entire level of reflection. It sounds fairly simple, however it could yield considerably higher outcomes.<\/p>\n<p class=\"wp-block-paragraph\">There\u2019s a considerable physique of analysis displaying the advantages of reflection:<\/p>\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*fZCBw-xDGLRiQaOSGKznQw.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Picture from \u201c<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2303.17651?utm_campaign=The%20Batch&amp;utm_source=hs_email&amp;utm_medium=email&amp;_hsenc=p2ANqtz-9dHVnW1I1bA3sPBbsikjT165Qez3QiiAssknCERwgki818YHG7PyHOQSgg-nxKDa0BuE7B\" rel=\"noreferrer noopener\" target=\"_blank\">Self-Refine: Iterative Refinement with Self-Suggestions<\/a>,\u201d Madaan et\u00a0al.\u00a0<\/figcaption><\/figure>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">In <strong>\u201c<\/strong><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2303.11366?utm_campaign=The%20Batch&amp;utm_source=hs_email&amp;utm_medium=email&amp;_hsenc=p2ANqtz-9dHVnW1I1bA3sPBbsikjT165Qez3QiiAssknCERwgki818YHG7PyHOQSgg-nxKDa0BuE7B\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Reflexion: Language Brokers with Verbal Reinforcement Studying<\/strong><\/a><strong>\u201d<\/strong> Shinn et al. (2023), the authors achieved a 91% move@1 accuracy on the HumanEval coding benchmark, surpassing the earlier state-of-the-art GPT-4, which scored simply 80%. In addition they discovered that Reflexion considerably outperforms all baseline approaches on the HotPotQA benchmark (a Wikipedia-based Q&amp;A dataset that challenges brokers to parse content material and cause over a number of supporting paperwork).<\/li>\n<\/ul>\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*UbEzF5n1PZbD8oRLJW3mMw.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Picture from \u201c<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2303.11366?utm_campaign=The%20Batch&amp;utm_source=hs_email&amp;utm_medium=email&amp;_hsenc=p2ANqtz-9dHVnW1I1bA3sPBbsikjT165Qez3QiiAssknCERwgki818YHG7PyHOQSgg-nxKDa0BuE7B\" rel=\"noreferrer noopener\" target=\"_blank\">Reflexion: Language Brokers with Verbal Reinforcement Studying<\/a>,\u201d Shinn et\u00a0al.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Reflection is very impactful in agentic techniques as a result of it may be used to course-correct at many steps of the method:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">When a consumer asks a query, the LLM can use reflection to guage whether or not the request is possible.<\/li>\n<li class=\"wp-block-list-item\">When the LLM places collectively an preliminary plan, it could use reflection to double-check whether or not the plan is sensible and may also help obtain the purpose.<\/li>\n<li class=\"wp-block-list-item\">After every execution step or device name, the agent can consider whether or not it\u2019s on monitor and whether or not it\u2019s value adjusting the plan.<\/li>\n<li class=\"wp-block-list-item\">When the plan is absolutely executed, the agent can mirror to see whether or not it has truly achieved the purpose and solved the duty.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">It\u2019s clear that reflection can considerably enhance accuracy. Nonetheless, there are trade-offs value discussing. Reflection may require a number of further calls to the LLM and probably different techniques, which might result in elevated latency and prices. So in enterprise circumstances, it\u2019s value contemplating whether or not the standard enhancements justify the bills and delays within the consumer circulation.<\/p>\n<h2 class=\"wp-block-heading\">Reflection in frameworks<\/h2>\n<p class=\"wp-block-paragraph\">Since there\u2019s little doubt that reflection brings worth to AI brokers, it\u2019s broadly utilized in fashionable frameworks. Let\u2019s have a look at some examples.<\/p>\n<p class=\"wp-block-paragraph\">The concept of reflection was first proposed within the paper <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2210.03629\" rel=\"noreferrer noopener\" target=\"_blank\">\u201cReAct: Synergizing Reasoning and Appearing in Language Fashions\u201d<\/a> by Yao et al. (2022). ReAct is a framework that mixes interleaving levels of Reasoning (reflection via specific thought traces) and Appearing (task-relevant actions in an surroundings). On this framework, reasoning guides the selection of actions, and actions produce new observations that inform additional reasoning. The reasoning stage itself is a mixture of reflection and planning.<\/p>\n<p class=\"wp-block-paragraph\">This framework grew to become fairly fashionable, so there at the moment are a number of off-the-shelf implementations, reminiscent of:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/dspy.ai\/api\/modules\/ReAct\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>DSPy<\/strong><\/a> framework by Databricks has a <code>ReAct<\/code> class,<\/li>\n<li class=\"wp-block-list-item\">In <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/langchain-ai.github.io\/langgraph\/agents\/agents\/#1-install-dependencies\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>LangGraph<\/strong><\/a>, you should use the <code>create_react_agent<\/code> perform,<\/li>\n<li class=\"wp-block-list-item\">Code brokers within the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/docs\/smolagents\/conceptual_guides\/react\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>smolagents<\/strong><\/a> library by HuggingFace are additionally based mostly on the ReAct structure.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\">Reflection from\u00a0scratch<\/h2>\n<p class=\"wp-block-paragraph\">Now that we\u2019ve realized the speculation and explored current implementations, it\u2019s time to get our fingers soiled and construct one thing ourselves. Within the ReAct strategy, brokers use reflection at every step, combining planning with reflection. Nonetheless, to know the affect of reflection extra clearly, we\u2019ll have a look at it in isolation.<\/p>\n<p class=\"wp-block-paragraph\">For instance, we\u2019ll use text-to-SQL: we\u2019ll give an LLM a query and count on it to return a sound SQL question. We\u2019ll be working with <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/datasets\/hrishitpatil\/flight-data-2024\" rel=\"noreferrer noopener\" target=\"_blank\">a flight delay dataset<\/a> and the ClickHouse SQL dialect.<\/p>\n<p class=\"wp-block-paragraph\">We\u2019ll begin through the use of direct era with none reflection as our baseline. Then, we\u2019ll attempt utilizing reflection by asking the mannequin to critique and enhance the SQL, or by offering it with further suggestions. After that, we\u2019ll measure the standard of our solutions to see whether or not reflection truly results in higher outcomes.<\/p>\n<h3 class=\"wp-block-heading\">Direct era<\/h3>\n<p class=\"wp-block-paragraph\">We\u2019ll start with probably the most simple strategy, direct era, the place we ask the LLM to generate SQL that solutions a consumer question.<\/p>\n<pre class=\"wp-block-prismatic-blocks\" datatext=\"el1761208924193\"><code class=\"language-bash\">pip set up anthropic<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We have to specify the API Key for the Anthropic API.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import os\nos.environ['ANTHROPIC_API_KEY'] = config['ANTHROPIC_API_KEY']<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The subsequent step is to initialise the shopper, and we\u2019re all set.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import anthropic\nshopper = anthropic.Anthropic()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now we are able to use this shopper to ship messages to the LLM. Let\u2019s put collectively a perform to generate SQL based mostly on a consumer question. I\u2019ve specified the system immediate with primary directions and detailed details about the info schema. I\u2019ve additionally created a perform to ship the system immediate and consumer question to the LLM.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">base_sql_system_prompt = '''\nYou're a senior SQL developer and your activity is to assist generate a SQL question based mostly on consumer necessities. \nYou might be working with ClickHouse database. Specify the format (Tab Separated With Names) within the SQL question output to make sure that column names are included within the output.\nDon't use rely(*) in your queries since it is a dangerous observe with columnar databases, favor utilizing rely().\nBe certain that the question is syntactically appropriate and optimized for efficiency, considering ClickHouse particular options (i.e. that ClickHouse is a columnar database and helps capabilities like ARRAY JOIN, SAMPLE, and so on.).\nReturn solely the SQL question with none further explanations or feedback.\n\nYou'll be working with flight_data desk which has the next schema:\n\nColumn Identify | Information Kind | Null % | Instance Worth | Description\n--- | --- | --- | --- | ---\nyr | Int64 | 0.0 | 2024 | Yr of flight\nmonth | Int64 | 0.0 | 1 | Month of flight (1\u201312)\nday_of_month | Int64 | 0.0 | 1 | Day of the month\nday_of_week | Int64 | 0.0 | 1 | Day of week (1=Monday \u2026 7=Sunday)\nfl_date | datetime64[ns] | 0.0 | 2024-01-01 00:00:00 | Flight date (YYYY-MM-DD)\nop_unique_carrier | object | 0.0 | 9E | Distinctive provider code\nop_carrier_fl_num | float64 | 0.0 | 4814.0 | Flight quantity for reporting airline\norigin | object | 0.0 | JFK | Origin airport code\norigin_city_name | object | 0.0 | \"New York, NY\" | Origin metropolis identify\norigin_state_nm | object | 0.0 | New York | Origin state identify\ndest | object | 0.0 | DTW | Vacation spot airport code\ndest_city_name | object | 0.0 | \"Detroit, MI\" | Vacation spot metropolis identify\ndest_state_nm | object | 0.0 | Michigan | Vacation spot state identify\ncrs_dep_time | Int64 | 0.0 | 1252 | Scheduled departure time (native, hhmm)\ndep_time | float64 | 1.31 | 1247.0 | Precise departure time (native, hhmm)\ndep_delay | float64 | 1.31 | -5.0 | Departure delay in minutes (unfavourable if early)\ntaxi_out | float64 | 1.35 | 31.0 | Taxi out time in minutes\nwheels_off | float64 | 1.35 | 1318.0 | Wheels-off time (native, hhmm)\nwheels_on | float64 | 1.38 | 1442.0 | Wheels-on time (native, hhmm)\ntaxi_in | float64 | 1.38 | 7.0 | Taxi in time in minutes\ncrs_arr_time | Int64 | 0.0 | 1508 | Scheduled arrival time (native, hhmm)\narr_time | float64 | 1.38 | 1449.0 | Precise arrival time (native, hhmm)\narr_delay | float64 | 1.61 | -19.0 | Arrival delay in minutes (unfavourable if early)\ncancelled | int64 | 0.0 | 0 | Cancelled flight indicator (0=No, 1=Sure)\ncancellation_code | object | 98.64 | B | Purpose for cancellation (if cancelled)\ndiverted | int64 | 0.0 | 0 | Diverted flight indicator (0=No, 1=Sure)\ncrs_elapsed_time | float64 | 0.0 | 136.0 | Scheduled elapsed time in minutes\nactual_elapsed_time | float64 | 1.61 | 122.0 | Precise elapsed time in minutes\nair_time | float64 | 1.61 | 84.0 | Flight time in minutes\ndistance | float64 | 0.0 | 509.0 | Distance between origin and vacation spot (miles)\ncarrier_delay | int64 | 0.0 | 0 | Service-related delay in minutes\nweather_delay | int64 | 0.0 | 0 | Climate-related delay in minutes\nnas_delay | int64 | 0.0 | 0 | Nationwide Air System delay in minutes\nsecurity_delay | int64 | 0.0 | 0 | Safety delay in minutes\nlate_aircraft_delay | int64 | 0.0 | 0 | Late plane delay in minutes\n'''\n\ndef generate_direct_sql(rec):\n  # making an LLM name\n  message = shopper.messages.create(\n    mannequin = \"claude-3-5-haiku-latest\",\n    # I selected smaller mannequin in order that it is simpler for us to see the affect \n    max_tokens = 8192,\n    system=base_sql_system_prompt,\n    messages = [\n        {'role': 'user', 'content': rec['question']}\n    ]\n  )\n\n  sql  = message.content material[0].textual content\n  \n  # cleansing the output\n  if sql.endswith('```'):\n    sql = sql[:-3]\n  if sql.startswith('```sql'):\n    sql = sql[6:]\n  return sql<\/code><\/pre>\n<p class=\"wp-block-paragraph\">That\u2019s it. Now let\u2019s take a look at our text-to-SQL answer. I\u2019ve created a <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/miptgirl\/miptgirl_medium\/blob\/main\/ai_under_the_hood\/data\/flight_data_qa_pairs.json\" target=\"_blank\" rel=\"noreferrer noopener\">small analysis set<\/a> of 20 question-and-answer pairs that we are able to use to test whether or not our system is working nicely. Right here\u2019s one instance:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">{\n'query': 'What was the very best velocity in mph?',\n'reply': '''\n    choose max(distance \/ (air_time \/ 60)) as max_speed \n    from flight_data \n    the place air_time &gt; 0 \n    format TabSeparatedWithNames'''\n}<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Let\u2019s use our text-to-SQL perform to generate SQL for all consumer queries within the take a look at set.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># load analysis set\nwith open('.\/knowledge\/flight_data_qa_pairs.json', 'r') as f:\n    qa_pairs = json.load(f)\nqa_pairs_df = pd.DataFrame(qa_pairs)\n\ntmp = []\n# executing LLM for every query in our eval set\nfor rec in tqdm.tqdm(qa_pairs_df.to_dict('data')):\n    llm_sql = generate_direct_sql(rec)\n    tmp.append(\n        {\n            'id': rec['id'],\n            'llm_direct_sql': llm_sql\n        }\n    )\n\nllm_direct_df = pd.DataFrame(tmp)\ndirect_result_df = qa_pairs_df.merge(llm_direct_df, on = 'id')<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now we&#8217;ve got our solutions, and the subsequent step is to measure the standard.<\/p>\n<h3 class=\"wp-block-heading\">Measuring high quality<\/h3>\n<p class=\"wp-block-paragraph\">Sadly, there\u2019s no single appropriate reply on this state of affairs, so we are able to\u2019t simply evaluate the SQL generated by the LLM to a reference reply. We have to provide you with a approach to measure high quality.<\/p>\n<p class=\"wp-block-paragraph\">There are some features of high quality that we are able to test with goal standards, however to test whether or not the LLM returned the fitting reply, we\u2019ll want to make use of an LLM. So I\u2019ll use a mixture of approaches:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">First, we\u2019ll use goal standards to test whether or not the proper format was specified within the SQL (we instructed the LLM to make use of <code>TabSeparatedWithNames<\/code>).<\/li>\n<li class=\"wp-block-list-item\">Second, we are able to execute the generated question and see whether or not ClickHouse returns an execution error.<\/li>\n<li class=\"wp-block-list-item\">Lastly, we are able to create an LLM decide that compares the output from the generated question to our reference reply and checks whether or not they differ.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Let\u2019s begin by executing the SQL. It\u2019s value noting that our <code>get_clickhouse_data<\/code> perform doesn\u2019t throw an exception. As a substitute, it returns textual content explaining the error, which could be dealt with by the LLM later.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">CH_HOST = 'http:\/\/localhost:8123' # default tackle \nimport requests\nimport pandas as pd\nimport tqdm\n\n# perform to execute SQL question\ndef get_clickhouse_data(question, host = CH_HOST, connection_timeout = 1500):\n  r = requests.publish(host, params = {'question': question}, \n    timeout = connection_timeout)\n  if r.status_code == 200:\n      return r.textual content\n  else: \n      return 'Database returned the next error:n' + r.textual content\n\n# getting the outcomes of SQL execution\ndirect_result_df['llm_direct_output'] = direct_result_df['llm_direct_sql'].apply(get_clickhouse_data)\ndirect_result_df['answer_output'] = direct_result_df['answer'].apply(get_clickhouse_data)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The subsequent step is to create an LLM decide. For this, I\u2019m utilizing a series\u2011of\u2011thought strategy that prompts the LLM to offer its reasoning earlier than giving the ultimate reply. This provides the mannequin time to assume via the issue, which improves response high quality.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">llm_judge_system_prompt = '''\nYou're a senior analyst and your activity is to match two SQL question outcomes and decide if they're equal. \nFocus solely on the info returned by the queries, ignoring any formatting variations. \nTake note of the preliminary consumer question and data wanted to reply it. For instance, if consumer requested for the common distance, and each queries return the identical common worth however in certainly one of them there's additionally a rely of data, it is best to contemplate them equal, since each present the identical requested info.\n\nReply with a JSON of the next construction:\n{\n  'reasoning': '<your reasoning=\"\" here=\"\" sentences=\"\" on=\"\" why=\"\" you=\"\" think=\"\" they=\"\" are=\"\" equivalent=\"\" or=\"\" not=\"\">', \n  'equivalence': <true>\n}\nBe certain that ONLY JSON is within the output. \n\nYou'll be working with flight_data desk which has the next schema:\nColumn Identify | Information Kind | Null % | Instance Worth | Description\n--- | --- | --- | --- | ---\nyr | Int64 | 0.0 | 2024 | Yr of flight\nmonth | Int64 | 0.0 | 1 | Month of flight (1\u201312)\nday_of_month | Int64 | 0.0 | 1 | Day of the month\nday_of_week | Int64 | 0.0 | 1 | Day of week (1=Monday \u2026 7=Sunday)\nfl_date | datetime64[ns] | 0.0 | 2024-01-01 00:00:00 | Flight date (YYYY-MM-DD)\nop_unique_carrier | object | 0.0 | 9E | Distinctive provider code\nop_carrier_fl_num | float64 | 0.0 | 4814.0 | Flight quantity for reporting airline\norigin | object | 0.0 | JFK | Origin airport code\norigin_city_name | object | 0.0 | \"New York, NY\" | Origin metropolis identify\norigin_state_nm | object | 0.0 | New York | Origin state identify\ndest | object | 0.0 | DTW | Vacation spot airport code\ndest_city_name | object | 0.0 | \"Detroit, MI\" | Vacation spot metropolis identify\ndest_state_nm | object | 0.0 | Michigan | Vacation spot state identify\ncrs_dep_time | Int64 | 0.0 | 1252 | Scheduled departure time (native, hhmm)\ndep_time | float64 | 1.31 | 1247.0 | Precise departure time (native, hhmm)\ndep_delay | float64 | 1.31 | -5.0 | Departure delay in minutes (unfavourable if early)\ntaxi_out | float64 | 1.35 | 31.0 | Taxi out time in minutes\nwheels_off | float64 | 1.35 | 1318.0 | Wheels-off time (native, hhmm)\nwheels_on | float64 | 1.38 | 1442.0 | Wheels-on time (native, hhmm)\ntaxi_in | float64 | 1.38 | 7.0 | Taxi in time in minutes\ncrs_arr_time | Int64 | 0.0 | 1508 | Scheduled arrival time (native, hhmm)\narr_time | float64 | 1.38 | 1449.0 | Precise arrival time (native, hhmm)\narr_delay | float64 | 1.61 | -19.0 | Arrival delay in minutes (unfavourable if early)\ncancelled | int64 | 0.0 | 0 | Cancelled flight indicator (0=No, 1=Sure)\ncancellation_code | object | 98.64 | B | Purpose for cancellation (if cancelled)\ndiverted | int64 | 0.0 | 0 | Diverted flight indicator (0=No, 1=Sure)\ncrs_elapsed_time | float64 | 0.0 | 136.0 | Scheduled elapsed time in minutes\nactual_elapsed_time | float64 | 1.61 | 122.0 | Precise elapsed time in minutes\nair_time | float64 | 1.61 | 84.0 | Flight time in minutes\ndistance | float64 | 0.0 | 509.0 | Distance between origin and vacation spot (miles)\ncarrier_delay | int64 | 0.0 | 0 | Service-related delay in minutes\nweather_delay | int64 | 0.0 | 0 | Climate-related delay in minutes\nnas_delay | int64 | 0.0 | 0 | Nationwide Air System delay in minutes\nsecurity_delay | int64 | 0.0 | 0 | Safety delay in minutes\nlate_aircraft_delay | int64 | 0.0 | 0 | Late plane delay in minutes\n'''\n\nllm_judge_user_prompt_template = '''\nRight here is the preliminary consumer question:\n{user_query}\n\nRight here is the SQL question generated by the primary analyst: \nSQL: \n{sql1} \n\nDatabase output: \n{result1}\n\nRight here is the SQL question generated by the second analyst:\nSQL:\n{sql2}\n\nDatabase output:\n{result2}\n'''\n\ndef llm_judge(rec, field_to_check):\n  # assemble the consumer immediate \n  user_prompt = llm_judge_user_prompt_template.format(\n    user_query = rec['question'],\n    sql1 = rec['answer'],\n    result1 = rec['answer_output'],\n    sql2 = rec[field_to_check + '_sql'],\n    result2 = rec[field_to_check + '_output']\n  )\n  \n  # make an LLM name\n  message = shopper.messages.create(\n      mannequin = \"claude-sonnet-4-5\",\n      max_tokens = 8192,\n      temperature = 0.1,\n      system = llm_judge_system_prompt,\n      messages=[\n          {'role': 'user', 'content': user_prompt}\n      ]\n  )\n  knowledge = message.content material[0].textual content\n  \n  # Strip markdown code blocks\n  knowledge = knowledge.strip()\n  if knowledge.startswith('```json'):\n      knowledge = knowledge[7:]\n  elif knowledge.startswith('```'):\n      knowledge = knowledge[3:]\n  if knowledge.endswith('```'):\n      knowledge = knowledge[:-3]\n  \n  knowledge = knowledge.strip()\n  return json.masses(knowledge)<\/true><\/your><\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now, let\u2019s run the LLM decide to get the outcomes.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">tmp = []\n\nfor rec in tqdm.tqdm(direct_result_df.to_dict('data')):\n  attempt:\n    judgment = llm_judge(rec, 'llm_direct')\n  besides Exception as e:\n    print(f\"Error processing report {rec['id']}: {e}\")\n    proceed\n  tmp.append(\n    {\n      'id': rec['id'],\n      'llm_judge_reasoning': judgment['reasoning'],\n      'llm_judge_equivalence': judgment['equivalence']\n    }\n  )\n\njudge_df = pd.DataFrame(tmp)\ndirect_result_df = direct_result_df.merge(judge_df, on = 'id')<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Let\u2019s have a look at one instance to see how the LLM decide works.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># consumer question \nIn 2024, what share of time all airplanes spent within the air?\n\n# appropriate reply \nchoose (sum(air_time) \/ sum(actual_elapsed_time)) * 100 as percentage_in_air \nthe place yr = 2024\nfrom flight_data \nformat TabSeparatedWithNames\n\npercentage_in_air\n81.43582596894757\n\n# generated by LLM reply \nSELECT \n    spherical(sum(air_time) \/ (sum(air_time) + sum(taxi_out) + sum(taxi_in)) * 100, 2) as air_time_percentage\nFROM flight_data\nWHERE yr = 2024\nFORMAT TabSeparatedWithNames\n\nair_time_percentage\n81.39\n\n# LLM decide response\n{\n 'reasoning': 'Each queries calculate the proportion of time airplanes \n    spent within the air, however use completely different denominators. The primary question \n    makes use of actual_elapsed_time (which incorporates air_time + taxi_out + taxi_in \n    + any floor delays), whereas the second makes use of solely (air_time + taxi_out \n    + taxi_in). The second question is strategy is extra correct for answering \n    \"time airplanes spent within the air\" because it excludes floor delays. \n    Nonetheless, the outcomes are very shut (81.44% vs 81.39%), suggesting minimal \n    affect. These are materially completely different approaches that occur to yield \n    comparable outcomes',\n 'equivalence': FALSE\n}<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The reasoning is sensible, so we are able to belief our decide. Now, let\u2019s test all LLM-generated queries.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def get_llm_accuracy(sql, output, equivalence): \n    issues = []\n    if 'format tabseparatedwithnames' not in sql.decrease():\n        issues.append('No format laid out in SQL')\n    if 'Database returned the next error' in output:\n        issues.append('SQL execution error')\n    if not equivalence and ('SQL execution error' not in issues):\n        issues.append('Fallacious reply supplied')\n    if len(issues) == 0:\n        return 'No issues detected'\n    else:\n        return ' + '.be part of(issues)\n\ndirect_result_df['llm_direct_sql_quality_heuristics'] = direct_result_df.apply(\n    lambda row: get_llm_accuracy(row['llm_direct_sql'], row['llm_direct_output'], row['llm_judge_equivalence']), axis=1)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The LLM returned the proper reply in 70% of circumstances, which isn&#8217;t dangerous. However there\u2019s undoubtedly room for enchancment, because it usually both offers the mistaken reply or fails to specify the format accurately (typically inflicting SQL execution errors).<\/p>\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*oUityyRHgCEMldnKQAFF0Q.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Picture by writer<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">Including a mirrored image step<\/h3>\n<p class=\"wp-block-paragraph\">To enhance the standard of our answer, let\u2019s attempt including a mirrored image step the place we ask the mannequin to evaluate and refine its reply.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">For a mirrored image name, I\u2019ll maintain the identical system immediate because it accommodates all the mandatory details about SQL and the info schema. However I\u2019ll tweak the consumer message to share the preliminary consumer question and the generated SQL, asking the LLM to critique and enhance it.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">simple_reflection_user_prompt_template = '''\nYour activity is to evaluate the SQL question generated by one other analyst and suggest enhancements if vital.\nVerify whether or not the question is syntactically appropriate and optimized for efficiency. \nTake note of nuances in knowledge (particularly time stamps varieties, whether or not to make use of complete elapsed time or time within the air, and so on).\nBe certain that the question solutions the preliminary consumer query precisely. \nBecause the outcome return the next JSON: \n{{\n  'reasoning': '<your reasoning=\"\" here=\"\" sentences=\"\" on=\"\" why=\"\" you=\"\" made=\"\" changes=\"\" or=\"\" not=\"\">', \n  'refined_sql': '<the improved=\"\" sql=\"\" query=\"\" here=\"\">'\n}}\nBe certain that ONLY JSON is within the output and nothing else. Be certain that the output JSON is legitimate. \n\nRight here is the preliminary consumer question:\n{user_query}\n\nRight here is the SQL question generated by one other analyst: \n{sql} \n'''\n\ndef simple_reflection(rec) -&gt; str:\n  # establishing a consumer immediate\n  user_prompt = simple_reflection_user_prompt_template.format(\n    user_query=rec['question'],\n    sql=rec['llm_direct_sql']\n  )\n  \n  # making an LLM name\n  message = shopper.messages.create(\n    mannequin=\"claude-3-5-haiku-latest\",\n    max_tokens = 8192,\n    system=base_sql_system_prompt,\n    messages=[\n        {'role': 'user', 'content': user_prompt}\n    ]\n  )\n\n  knowledge  = message.content material[0].textual content\n\n  # strip markdown code blocks\n  knowledge = knowledge.strip()\n  if knowledge.startswith('```json'):\n    knowledge = knowledge[7:]\n  elif knowledge.startswith('```'):\n    knowledge = knowledge[3:]\n  if knowledge.endswith('```'):\n    knowledge = knowledge[:-3]\n  \n  knowledge = knowledge.strip()\n  return json.masses(knowledge.exchange('n', ' '))<\/the><\/your><\/code><\/pre>\n<p class=\"wp-block-paragraph\">Let\u2019s refine the queries with reflection and measure the accuracy. We don\u2019t see a lot enchancment within the last high quality. We\u2019re nonetheless at 70% appropriate solutions.<\/p>\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*fBn9HoJJHtWnZE7l2MqdFg.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Picture by writer<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Let\u2019s have a look at particular examples to know what occurred. First, there are a few circumstances the place the LLM managed to repair the issue, both by correcting the format or by including lacking logic to deal with zero values.<\/p>\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/2400\/1*vsoqKVhiOEx9VhEbEle1lg.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Picture by writer<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Nonetheless, there are additionally circumstances the place the LLM overcomplicated the reply. The preliminary SQL was appropriate (matching the golden set reply), however then the LLM determined to \u2018enhance\u2019 it. A few of these enhancements are affordable (e.g., accounting for nulls or excluding cancelled flights). Nonetheless, for some cause, it determined to make use of ClickHouse sampling, though we don\u2019t have a lot knowledge and our desk doesn\u2019t help sampling. Consequently, the refined question returned an execution error: <code>Database returned the next error: Code: 141. DB::Exception: Storage default.flight_data does not help sampling. (SAMPLING_NOT_SUPPORTED)<\/code>.<\/p>\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*JFsKJgR3O7tnrszxQrOk7Q.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Picture by writer<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">Reflection with exterior\u00a0suggestions <\/h3>\n<p class=\"wp-block-paragraph\">Reflection didn\u2019t enhance accuracy a lot. That is possible as a result of we didn\u2019t present any further info that will assist the mannequin generate a greater outcome. Let\u2019s attempt sharing exterior suggestions with the mannequin:<\/p>\n<p class=\"wp-block-paragraph\">The results of our test on whether or not the format is specified accurately<br \/>The output from the database (both knowledge or an error message)<br \/>Let\u2019s put collectively a immediate for this and generate a brand new model of the SQL.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">feedback_reflection_user_prompt_template = '''\nYour activity is to evaluate the SQL question generated by one other analyst and suggest enhancements if vital.\nVerify whether or not the question is syntactically appropriate and optimized for efficiency. \nTake note of nuances in knowledge (particularly time stamps varieties, whether or not to make use of complete elapsed time or time within the air, and so on).\nBe certain that the question solutions the preliminary consumer query precisely. \n\nBecause the outcome return the next JSON: \n{{\n  'reasoning': '<your reasoning=\"\" here=\"\" sentences=\"\" on=\"\" why=\"\" you=\"\" made=\"\" changes=\"\" or=\"\" not=\"\">', \n  'refined_sql': '<the improved=\"\" sql=\"\" query=\"\" here=\"\">'\n}}\nBe certain that ONLY JSON is within the output and nothing else. Be certain that the output JSON is legitimate. \n\nRight here is the preliminary consumer question:\n{user_query}\n\nRight here is the SQL question generated by one other analyst: \n{sql} \n\nRight here is the database output of this question: \n{output}\n\nWe run an automated test on the SQL question to test whether or not it has fomatting points. This is the output: \n{formatting}\n'''\n\ndef feedback_reflection(rec) -&gt; str:\n  # outline message for formatting \n  if 'No format laid out in SQL' in rec['llm_direct_sql_quality_heuristics']:\n    formatting = 'SQL lacking formatting. Specify \"format TabSeparatedWithNames\" to make sure that column names are additionally returned'\n  else: \n    formatting = 'Formatting is appropriate'\n\n  # establishing a consumer immediate\n  user_prompt = feedback_reflection_user_prompt_template.format(\n    user_query = rec['question'],\n    sql = rec['llm_direct_sql'],\n    output = rec['llm_direct_output'],\n    formatting = formatting\n  )\n\n  # making an LLM name \n  message = shopper.messages.create(\n    mannequin = \"claude-3-5-haiku-latest\",\n    max_tokens = 8192,\n    system = base_sql_system_prompt,\n    messages = [\n        {'role': 'user', 'content': user_prompt}\n    ]\n  )\n  knowledge  = message.content material[0].textual content\n\n  # strip markdown code blocks\n  knowledge = knowledge.strip()\n  if knowledge.startswith('```json'):\n    knowledge = knowledge[7:]\n  elif knowledge.startswith('```'):\n    knowledge = knowledge[3:]\n  if knowledge.endswith('```'):\n    knowledge = knowledge[:-3]\n  \n  knowledge = knowledge.strip()\n  return json.masses(knowledge.exchange('n', ' '))<\/the><\/your><\/code><\/pre>\n<p class=\"wp-block-paragraph\">After working our accuracy measurements, we are able to see that accuracy has improved considerably: 17 appropriate solutions (85% accuracy) in comparison with 14 (70% accuracy).<\/p>\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*XlUBC2ByIuDBQa2phJNNWQ.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Picture by writer<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">If we test the circumstances the place the LLM mounted the problems, we are able to see that it was in a position to appropriate the format, tackle SQL execution errors, and even revise the enterprise logic (e.g., utilizing air time for calculating velocity).<\/p>\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/2400\/1*6_2AfwAxxEBZIkmih0UBxg.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Picture by writer<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Let\u2019s additionally do some error evaluation to look at the circumstances the place the LLM made errors. Within the desk under, we are able to see that the LLM struggled with defining sure timestamps, incorrectly calculating complete time, or utilizing complete time as a substitute of air time for velocity calculations. Nonetheless, a number of the discrepancies are a bit difficult:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Within the final question, the time interval wasn\u2019t explicitly outlined, so it\u2019s affordable for the LLM to make use of 2010\u20132023. I wouldn\u2019t contemplate this an error, and I\u2019d alter the analysis as a substitute.<\/li>\n<li class=\"wp-block-list-item\">One other instance is how one can outline airline velocity: <code>avg(distance\/time)<\/code> or <code>sum(distance)\/sum(time)<\/code>. Each choices are legitimate since nothing was specified within the consumer question or system immediate (assuming we don\u2019t have a predefined calculation technique).<\/li>\n<\/ul>\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/2400\/1*wCjbvibivJgbBrdAxRI1zw.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Picture by writer<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">General, I feel we achieved a fairly good outcome. Our last 85% accuracy represents a big 15% level enchancment. You would probably transcend one iteration and run 2\u20133 rounds of reflection, however it\u2019s value assessing once you hit diminishing returns in your particular case, since every iteration goes with elevated price and latency.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>You&#8217;ll find the complete code on <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/miptgirl\/miptgirl_medium\/tree\/main\/ai_under_the_hood\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub<\/a>.<\/em><\/p>\n<\/blockquote>\n<h2 class=\"wp-block-heading\">Abstract<\/h2>\n<p class=\"wp-block-paragraph\">It\u2019s time to wrap issues up. On this article, we began our journey into understanding how the magic of agentic AI techniques works. To determine it out, we\u2019ll implement a multi-agent text-to-data device utilizing solely API calls to basis fashions. Alongside the best way, we\u2019ll stroll via the important thing design patterns step-by-step: beginning right now with reflection, and shifting on to device use, planning, and multi-agent coordination.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">On this article, we began with probably the most basic sample\u200a\u2014\u200areflection. Reflection is on the core of any agentic circulation, because the LLM must mirror on its progress towards attaining the top purpose.<\/p>\n<p class=\"wp-block-paragraph\">Reflection is a comparatively simple sample. We merely ask the identical or a special mannequin to analyse the outcome and try to enhance it. As we realized in observe, sharing exterior suggestions with the mannequin (like outcomes from static checks or database output) considerably improves accuracy. A number of analysis research and our personal expertise with the text-to-SQL agent show the advantages of reflection. Nonetheless, these accuracy positive aspects come at a price: extra tokens spent and better latency as a result of a number of API calls.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>Thanks for studying. I hope this text was insightful. Keep in mind Einstein\u2019s recommendation: \u201cThe necessary factor is to not cease questioning. Curiosity has its personal cause for current.\u201d Could your curiosity lead you to your subsequent nice perception.<\/em><\/p>\n<\/blockquote>\n<h2 class=\"wp-block-heading\">Reference<\/h2>\n<p class=\"wp-block-paragraph\">This text is impressed by the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.deeplearning.ai\/courses\/agentic-ai\/\" rel=\"noreferrer noopener\" target=\"_blank\">\u201cAgentic AI\u201d<\/a> course by Andrew Ng from DeepLearning.AI.<\/p>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>says that \u201cany sufficiently superior expertise is indistinguishable from magic\u201d. That\u2019s precisely how lots of right now\u2019s AI frameworks really feel. Instruments like GitHub Copilot, Claude Desktop, OpenAI Operator, and Perplexity Comet are automating on a regular basis duties that will\u2019ve appeared not possible to automate simply 5 years in the past. What\u2019s much more [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":8033,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[2105,4397,545],"class_list":["post-8031","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-agentic","tag-principles","tag-reflection"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/8031","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=8031"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/8031\/revisions"}],"predecessor-version":[{"id":8032,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/8031\/revisions\/8032"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/8033"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=8031"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=8031"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=8031"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-05-06 16:40:06 UTC -->