{"id":2625,"date":"2025-05-19T20:07:20","date_gmt":"2025-05-19T20:07:20","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=2625"},"modified":"2025-05-19T20:07:20","modified_gmt":"2025-05-19T20:07:20","slug":"set-the-variety-of-bushes-in-random-forest","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=2625","title":{"rendered":"Set the Variety of Bushes in Random Forest"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<details class=\"wp-block-details is-layout-flow wp-block-details-is-layout-flow\">\n<summary>Scientific publication<\/summary>\n<figure class=\"wp-block-pullquote has-subtitle-1-font-size\">\n<blockquote>\n<p>T. M. Lange, M. G\u00fcltas, A. O. Schmitt &amp; F. Heinrich (2025). optRF: Optimising random forest stability by figuring out the optimum variety of bushes. <em>BMC bioinformatics<\/em>, 26(1), 95.<\/p>\n<p><cite>Observe this <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/rdcu.be\/efTbn\">LINK<\/a> to the unique publication.<\/cite><\/p><\/blockquote>\n<\/figure>\n<\/details>\n<h2 class=\"wp-block-heading\"> Forest \u2014 A Highly effective Software for Anybody Working With Knowledge<\/h2>\n<h3 class=\"wp-block-heading\">What&#8217;s Random Forest?<\/h3>\n<p class=\"wp-block-paragraph\">Have you ever ever wished you possibly can make higher choices utilizing knowledge \u2014 like predicting the danger of illnesses, crop yields, or recognizing patterns in buyer conduct? That\u2019s the place machine studying is available in and one of the vital accessible and highly effective instruments on this area is one thing known as Random Forest.<\/p>\n<p class=\"wp-block-paragraph\">So why is random forest so standard? For one, it\u2019s extremely versatile. It really works nicely with many varieties of knowledge whether or not numbers, classes, or each. It\u2019s additionally extensively utilized in many fields \u2014 from predicting affected person outcomes in healthcare to detecting fraud in finance, from bettering buying experiences on-line to optimising agricultural practices.<\/p>\n<p class=\"wp-block-paragraph\">Regardless of the title, random forest has nothing to do with bushes in a forest \u2014 but it surely does use one thing known as <em><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/towardsdatascience.com\/tag\/decision-trees\/\" title=\"Decision Trees\">Determination Bushes<\/a><\/em> to make good predictions. You may consider a choice tree as a flowchart that guides a collection of sure\/no questions primarily based on the info you give it. A random forest creates a complete bunch of those bushes (therefore the \u201cforest\u201d), every barely totally different, after which combines their outcomes to make one ultimate determination. It\u2019s a bit like asking a bunch of consultants for his or her opinion after which going with the bulk vote.<\/p>\n<p class=\"wp-block-paragraph\">However till lately, one query was unanswered: What number of determination bushes do I really want? If every determination tree can result in totally different outcomes, averaging many bushes would result in higher and extra dependable outcomes. However what number of are sufficient? Fortunately, the optRF package deal solutions this query!<\/p>\n<p class=\"wp-block-paragraph\">So let\u2019s take a look at methods to optimise <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/towardsdatascience.com\/tag\/random-forest\/\" title=\"Random Forest\">Random Forest<\/a> for predictions and variable choice!<\/p>\n<h2 class=\"wp-block-heading\">Making Predictions with Random Forests<\/h2>\n<p class=\"wp-block-paragraph\">To optimise and to make use of random forest for making predictions, we will use the open-source statistics programme R. As soon as we open R, we&#8217;ve to put in the 2 R packages \u201cranger\u201d which permits to make use of random forests in R and \u201coptRF\u201d to optimise random forests. Each packages are open-source and obtainable through the official R repository CRAN. With the intention to set up and cargo these packages, the next strains of R code will be run:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">&gt; set up.packages(\u201cranger\u201d)\n&gt; set up.packages(\u201coptRF\u201d)\n&gt; library(ranger)\n&gt; library(optRF)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now that the packages are put in and loaded into the library, we will use the capabilities that these packages comprise. Moreover, we will additionally use the info set included within the optRF package deal which is free to make use of underneath the GPL license (simply because the optRF package deal itself). This knowledge set known as SNPdata incorporates within the first column the yield of 250 wheat crops in addition to 5000 genomic markers (so known as single nucleotide polymorphisms or SNPs) that may comprise both the worth 0 or 2.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">&gt; SNPdata[1:5,1:5]\n            Yield SNP_0001 SNP_0002 SNP_0003 SNP_0004\n  ID_001 670.7588        0        0        0        0\n  ID_002 542.5611        0        2        0        0\n  ID_003 591.6631        2        2        0        2\n  ID_004 476.3727        0        0        0        0\n  ID_005 635.9814        2        2        0        2<\/code><\/pre>\n<p class=\"wp-block-paragraph\">This knowledge set is an instance for genomic knowledge and can be utilized for genomic prediction which is an important device for breeding high-yielding crops and, thus, to combat world starvation. The concept is to foretell the yield of crops utilizing genomic markers. And precisely for this goal, random forest can be utilized! That implies that a random forest mannequin is used to explain the connection between the yield and the genomic markers. Afterwards, we will predict the yield of wheat crops the place we solely have genomic markers.<\/p>\n<p class=\"wp-block-paragraph\">Due to this fact, let\u2019s think about that we&#8217;ve 200 wheat crops the place we all know the yield and the genomic markers. That is the so-called coaching knowledge set. Let\u2019s additional assume that we&#8217;ve 50 wheat crops the place we all know the genomic markers however not their yield. That is the so-called take a look at knowledge set. Thus, we separate the info body SNPdata in order that the primary 200 rows are saved as coaching and the final 50 rows with out their yield are saved as take a look at knowledge:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">&gt; Coaching = SNPdata[1:200,]\n&gt; Take a look at = SNPdata[201:250,-1]<\/code><\/pre>\n<p class=\"wp-block-paragraph\">With these knowledge units, we will now take a look at methods to make predictions utilizing random forests!<\/p>\n<p class=\"wp-block-paragraph\">First, we received to calculate the optimum variety of bushes for random forest. Since we wish to make predictions, we use the operate <code>opt_prediction<\/code> from the optRF package deal. Into this operate we&#8217;ve to insert the response from the coaching knowledge set (on this case the yield), the predictors from the coaching knowledge set (on this case the genomic markers), and the predictors from the take a look at knowledge set. Earlier than we run this operate, we will use the set.seed operate to make sure reproducibility despite the fact that this isn&#8217;t obligatory (we&#8217;ll see later why reproducibility is a matter right here):<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">&gt; set.seed(123)\n&gt; optRF_result = opt_prediction(y = Coaching[,1], \n+                               X = Coaching[,-1], \n+                               X_Test = Take a look at)\n  Really helpful variety of bushes: 19000<\/code><\/pre>\n<p class=\"wp-block-paragraph\">All the outcomes from the <code>opt_prediction<\/code> operate are actually saved within the object optRF_result, nevertheless, crucial data was already printed within the console: For this knowledge set, we must always use 19,000 bushes.<\/p>\n<p class=\"wp-block-paragraph\">With this data, we will now use random forest to make predictions. Due to this fact, we use the ranger operate to derive a random forest mannequin that describes the connection between the genomic markers and the yield within the coaching knowledge set. Additionally right here, we&#8217;ve to insert the response within the y argument and the predictors within the x argument. Moreover, we will set the <code>write.forest<\/code> argument to be TRUE and we will insert the optimum variety of bushes within the <code>num.bushes<\/code> argument:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">&gt; RF_model = ranger(y = Coaching[,1], x = Coaching[,-1], \n+                   write.forest = TRUE, num.bushes = 19000)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">And that\u2019s it! The item <code>RF_model<\/code> incorporates the random forest mannequin that describes the connection between the genomic markers and the yield. With this mannequin, we will now predict the yield for the 50 crops within the take a look at knowledge set the place we&#8217;ve the genomic markers however we don\u2019t know the yield:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">&gt; predictions = predict(RF_model, knowledge=Take a look at)$predictions\n&gt; predicted_Test = knowledge.body(ID = row.names(Take a look at), predicted_yield = predictions)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The info body predicted_Test now incorporates the IDs of the wheat crops along with their predicted yield:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">&gt; head(predicted_Test)\n      ID predicted_yield\n  ID_201        593.6063\n  ID_202        596.8615\n  ID_203        591.3695\n  ID_204        589.3909\n  ID_205        599.5155\n  ID_206        608.1031<\/code><\/pre>\n<h2 class=\"wp-block-heading\">Variable Choice with Random Forests<\/h2>\n<p class=\"wp-block-paragraph\">A distinct method to analysing such an information set could be to seek out out which variables are most essential to foretell the response. On this case, the query could be which genomic markers are most essential to foretell the yield. Additionally this may be completed with random forests!<\/p>\n<p class=\"wp-block-paragraph\">If we deal with such a job, we don\u2019t want a coaching and a take a look at knowledge set. We will merely use your complete knowledge set SNPdata and see which of the variables are crucial ones. However earlier than we do this, we must always once more decide the optimum variety of bushes utilizing the optRF package deal. Since we&#8217;re insterested in calculating the variable significance, we use the operate <code>opt_importance<\/code>:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">&gt; set.seed(123)\n&gt; optRF_result = opt_importance(y=SNPdata[,1], \n+                               X=SNPdata[,-1])\n  Really helpful variety of bushes: 40000<\/code><\/pre>\n<p class=\"wp-block-paragraph\">One can see that the optimum variety of bushes is now larger than it was for predictions. That is truly usually the case. Nevertheless, with this variety of bushes, we will now use the ranger operate to calculate the significance of the variables. Due to this fact, we use the ranger operate as earlier than however we modify the variety of bushes within the num.bushes argument to 40,000 and we set the significance argument to \u201cpermutation\u201d (different choices are \u201cimpurity\u201d and \u201cimpurity_corrected\u201d).\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">&gt; set.seed(123) \n&gt; RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1], \n+                   write.forest = TRUE, num.bushes = 40000,\n+                   significance=\"permutation\")\n&gt; D_VI = knowledge.body(variable = names(SNPdata)[-1], \n+                   significance = RF_model$variable.significance)\n&gt; D_VI = D_VI[order(D_VI$importance, decreasing=TRUE),]<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The info body D_VI now incorporates all of the variables, thus, all of the genomic markers, and subsequent to it, their significance. Additionally, we&#8217;ve immediately ordered this knowledge body in order that crucial markers are on the highest and the least essential markers are on the backside of this knowledge body. Which implies that we will take a look at crucial variables utilizing the pinnacle operate:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">&gt; head(D_VI)\n  variable significance\n  SNP_0020   45.75302\n  SNP_0004   38.65594\n  SNP_0019   36.81254\n  SNP_0050   34.56292\n  SNP_0033   30.47347\n  SNP_0043   28.54312<\/code><\/pre>\n<p class=\"wp-block-paragraph\">And that\u2019s it! We&#8217;ve used random forest to make predictions and to estimate crucial variables in an information set. Moreover, we&#8217;ve optimised random forest utilizing the optRF package deal!<\/p>\n<h2 class=\"wp-block-heading\">Why Do We Want Optimisation?<\/h2>\n<p class=\"wp-block-paragraph\">Now that we\u2019ve seen how straightforward it&#8217;s to make use of random forest and the way rapidly it may be optimised, it\u2019s time to take a better have a look at what\u2019s occurring behind the scenes. Particularly, we\u2019ll discover how random forest works and why the outcomes would possibly change from one run to a different.<\/p>\n<p class=\"wp-block-paragraph\">To do that, we\u2019ll use random forest to calculate the significance of every genomic marker however as a substitute of optimising the variety of bushes beforehand, we\u2019ll follow the default settings within the ranger operate. By default, ranger makes use of 500 determination bushes. Let\u2019s strive it out:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">&gt; set.seed(123) \n&gt; RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1], \n+                   write.forest = TRUE, significance=\"permutation\")\n&gt; D_VI = knowledge.body(variable = names(SNPdata)[-1], \n+                   significance = RF_model$variable.significance)\n&gt; D_VI = D_VI[order(D_VI$importance, decreasing=TRUE),]\n&gt; head(D_VI)\n  variable significance\n  SNP_0020   80.22909\n  SNP_0019   60.37387\n  SNP_0043   50.52367\n  SNP_0005   43.47999\n  SNP_0034   38.52494\n  SNP_0015   34.88654<\/code><\/pre>\n<p class=\"wp-block-paragraph\">As anticipated, all the pieces runs easily \u2014 and rapidly! In reality, this run was considerably sooner than once we beforehand used 40,000 bushes. However what occurs if we run the very same code once more however this time with a distinct seed?<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">&gt; set.seed(321) \n&gt; RF_model2 = ranger(y=SNPdata[,1], x=SNPdata[,-1], \n+                    write.forest = TRUE, significance=\"permutation\")\n&gt; D_VI2 = knowledge.body(variable = names(SNPdata)[-1], \n+                    significance = RF_model2$variable.significance)\n&gt; D_VI2 = D_VI2[order(D_VI2$importance, decreasing=TRUE),]\n&gt; head(D_VI2)\n  variable significance\n  SNP_0050   60.64051\n  SNP_0043   58.59175\n  SNP_0033   52.15701\n  SNP_0020   51.10561\n  SNP_0015   34.86162\n  SNP_0019   34.21317<\/code><\/pre>\n<p class=\"wp-block-paragraph\">As soon as once more, all the pieces seems to work effective however take a better have a look at the outcomes. Within the first run, SNP_0020 had the best significance rating at 80.23, however within the second run, SNP_0050 takes the highest spot and SNP_0020 drops to the fourth place with a a lot decrease significance rating of 51.11. That\u2019s a big shift! So what modified?<\/p>\n<p class=\"wp-block-paragraph\">The reply lies in one thing known as <em>non-determinism<\/em>. Random forest, because the title suggests, entails quite a lot of randomness: it randomly selects knowledge samples and subsets of variables at numerous factors throughout coaching. This randomness helps stop overfitting but it surely additionally implies that outcomes can differ barely every time you run the algorithm \u2014 even with the very same knowledge set. That\u2019s the place the set.seed() operate is available in. It acts like a bookmark in a shuffled deck of playing cards. By setting the identical seed, you make sure that the random decisions made by the algorithm comply with the identical sequence each time you run the code. However once you change the seed, you\u2019re successfully altering the random path the algorithm follows. That\u2019s why, in our instance, crucial genomic markers got here out otherwise in every run. This conduct \u2014 the place the identical course of can yield totally different outcomes attributable to inside randomness \u2014 is a basic instance of non-determinism in machine studying.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/optimising_random_forest-1024x682.png\" alt=\"Illustration of the relationship between the stability and the number of trees in Random Forest\" class=\"wp-image-604207\"\/><\/figure>\n<p class=\"wp-block-paragraph\">As we simply noticed, random forest fashions can produce barely totally different outcomes each time you run them even when utilizing the identical knowledge because of the algorithm\u2019s built-in randomness. So, how can we cut back this randomness and make our outcomes extra steady?<\/p>\n<p class=\"wp-block-paragraph\">One of many easiest and best methods is to extend the variety of bushes. Every tree in a random forest is skilled on a random subset of the info and variables, so the extra bushes we add, the higher the mannequin can \u201ccommon out\u201d the noise attributable to particular person bushes. Consider it like asking 10 folks for his or her opinion versus asking 1,000 \u2014 you\u2019re extra prone to get a dependable reply from the bigger group.<\/p>\n<p class=\"wp-block-paragraph\">With extra bushes, the mannequin\u2019s predictions and variable significance rankings are inclined to turn out to be extra steady and reproducible even with out setting a particular seed. In different phrases, including extra bushes helps to tame the randomness. Nevertheless, there\u2019s a catch. Extra bushes additionally imply extra computation time. Coaching a random forest with 500 bushes would possibly take a couple of seconds however coaching one with 40,000 bushes might take a number of minutes or extra, relying on the scale of your knowledge set and your pc\u2019s efficiency.<\/p>\n<p class=\"wp-block-paragraph\">Nevertheless, the connection between the soundness and the computation time of random forest is <em>non-linear<\/em>. Whereas going from 500 to 1,000 bushes can considerably enhance stability, going from 5,000 to 10,000 bushes would possibly solely present a tiny enchancment in stability whereas doubling the computation time. Sooner or later, you hit a plateau the place including extra bushes provides diminishing returns \u2014 you pay extra in computation time however acquire little or no in stability. That\u2019s why it\u2019s important to seek out the suitable steadiness: Sufficient bushes to make sure steady outcomes however not so many who your evaluation turns into unnecessarily sluggish.<\/p>\n<p class=\"wp-block-paragraph\">And that is precisely what the optRF package deal does: it analyses the connection between the soundness and the variety of bushes in random forests and makes use of this relationship to find out the optimum variety of bushes that results in steady outcomes and past which including extra bushes would unnecessarily improve the computation time.<\/p>\n<p class=\"wp-block-paragraph\">Above, we&#8217;ve already used the opt_importance operate and saved the outcomes as optRF_result. This object incorporates the details about the optimum variety of bushes but it surely additionally incorporates details about the connection between the soundness and the variety of bushes. Utilizing the plot_stability operate, we will visualise this relationship. Due to this fact, we&#8217;ve to insert the title of the optRF object, which measure we&#8217;re occupied with (right here, we have an interest within the \u201csignificance\u201d), the interval we wish to visualise on the X axis, and if the really useful variety of bushes must be added:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">&gt; plot_stability(optRF_result, measure=\"significance\", \n+                from=0, to=50000, add_recommendation=FALSE)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/random_forest_stability_diagram-1024x745.png\" alt=\"R graph that visualises the stability of random forest depending on the number of decision trees\" class=\"wp-image-604208\"\/><figcaption class=\"wp-element-caption\">The output of the plot_stability operate visualises the soundness of random forest relying on the variety of determination bushes<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">This plot clearly reveals the non-linear relationship between stability and the variety of bushes. With 500 bushes, random forest solely results in a stability of round 0.2 which explains why the outcomes modified drastically when repeating random forest after setting a distinct seed. With the really useful 40,000 bushes, nevertheless, the soundness is close to 1 (which signifies an ideal stability). Including greater than 40,000 bushes would get the soundness additional to 1 however this improve could be solely very small whereas the computation time would additional improve. That&#8217;s the reason 40,000 bushes point out the optimum variety of bushes for this knowledge set.<\/p>\n<h2 class=\"wp-block-heading\">The Takeaway: Optimise Random Forest to Get the Most of It<\/h2>\n<p class=\"wp-block-paragraph\">Random forest is a robust ally for anybody working with knowledge \u2014 whether or not you\u2019re a researcher, analyst, scholar, or knowledge scientist. It\u2019s straightforward to make use of, remarkably versatile, and extremely efficient throughout a variety of functions. However like every device, utilizing it nicely means understanding what\u2019s occurring underneath the hood. On this put up, we\u2019ve uncovered considered one of its hidden quirks: The randomness that makes it robust also can make it unstable if not fastidiously managed. Thankfully, with the optRF package deal, we will strike the proper steadiness between stability and efficiency, guaranteeing we get dependable outcomes with out losing computational sources. Whether or not you\u2019re working in genomics, drugs, economics, agriculture, or some other data-rich area, mastering this steadiness will show you how to make smarter, extra assured choices primarily based in your knowledge.<\/p>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>Scientific publication T. M. Lange, M. G\u00fcltas, A. O. Schmitt &amp; F. Heinrich (2025). optRF: Optimising random forest stability by figuring out the optimum variety of bushes. BMC bioinformatics, 26(1), 95. Observe this LINK to the unique publication. Forest \u2014 A Highly effective Software for Anybody Working With Knowledge What&#8217;s Random Forest? Have you ever [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":2627,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[2555,2552,2554,687,2553],"class_list":["post-2625","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-forest","tag-number","tag-random","tag-set","tag-trees"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/2625","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2625"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/2625\/revisions"}],"predecessor-version":[{"id":2626,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/2625\/revisions\/2626"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/2627"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2625"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2625"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2625"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-05-14 20:04:37 UTC -->