we attempt to practice a really deep neural community mannequin, one problem that we’d encounter is the vanishing gradient<\/em> downside. That is primarily an issue the place the load replace of a mannequin throughout coaching slows down and even stops, therefore inflicting the mannequin to not enhance. When a community could be very deep, the gradient<\/em> computation throughout backpropagation entails multiplying many spinoff phrases collectively by way of the chain rule. Keep in mind that if we multiply small numbers (usually lower than 1) too many instances, it’ll make the ensuing numbers changing into extraordinarily small. Within the case of neural networks, these numbers are used as the premise of the load replace. So, if the gradient could be very small, then the load replace can be very gradual, inflicting the coaching to be gradual as properly.\u00a0<\/p>\n

To deal with this vanishing gradient downside, we are able to really use shortcut paths in order that the gradients can move extra simply by way of a deep community. One of the vital fashionable architectures that makes an attempt to resolve that is ResNet, the place it implements skip connections that soar over a number of layers within the community. This concept is adopted by DenseNet, the place the skip connections are carried out way more aggressively, making it higher than ResNet in dealing with the vanishing gradient downside. On this article I want to speak about how precisely DenseNet works and tips on how to implement the structure from scratch.<\/p>\n

\n
The DenseNet Structure<\/h2>\n
Dense Block<\/h3>\n
DenseNet was initially proposed in a paper titled \u201cDensely Linked Convolutional Networks<\/em>\u201d written by Gao Huang et al.<\/em> again in 2016 [1]. The principle thought of DenseNet is certainly to resolve the vanishing gradient downside. The explanation that it performs higher than ResNet is due to the shortcut paths branching out from a single layer to all different subsequent layers. To raised illustrate this concept, you possibly can see in Determine 1 under that the enter tensor x\u2080<\/em> is forwarded to H\u2081<\/em>, H\u2082<\/em>, H\u2083<\/em>, H\u2084<\/em>, and the transition<\/em> layers. We do the identical factor to all layers inside this block, making all tensors linked densely<\/em>\u200a\u2014\u200atherefore the title DenseNet<\/em>. With all these shortcut connections, info can move seamlessly between layers. Not solely that, however this mechanism additionally allows function reuse the place every layer can instantly profit from the options produced by all earlier layers.<\/p>\n
$\"\"$
Determine 1. The construction of a single Dense block\u00a0[1].<\/figcaption><\/figure>\n
In a typical CNN, if we’ve got L<\/em> layers, we can even have L<\/em> connections. Assuming that the above illustration is only a conventional 5-layer CNN, we principally solely have the 5 straight arrows popping out from every tensor. In DenseNet, if we’ve got L<\/em> layers, we may have L<\/em>(L<\/em>+1)\/2 connections. So within the above case we principally received 5(5+1)\/2 = 15 connections in whole. You’ll be able to confirm this by manually tallying the arrows one after the other: 5 purple arrows, 4 inexperienced arrows, 3 purple arrows, 2 yellow arrows, and 1 brown arrow.<\/p>\n
One other key distinction between ResNet and DenseNet is how they mix info from totally different layers. In ResNet, we mix info from two tensors by element-wise summation, which may mathematically be outlined in Determine 2 under. As a substitute of performing element-wise summation, DenseNet combines info by channel-wise concatenation as expressed in Determine 3. With this mechanism, the function maps produced by all earlier layers are concatenated with the output of the present layer earlier than ultimately getting used because the enter of the next layer.<\/p>\n
$\"\"$
Determine 2. The mathematical notation of a residual block in ResNet\u00a0[1].<\/figcaption><\/figure>\n
$\"\"$
Determine 3. The mathematical notation of the final layer inside a dense block in DenseNet\u00a0[1].<\/figcaption><\/figure>\n
Performing channel-wise concatenation like this really has a facet impact: the variety of function maps grows as we get deeper into the community. Within the instance I confirmed you in Determine 1, we initially have an enter tensor of 6 channels. The H\u2081 <\/em>layer processes this tensor and produces a 4-channel tensor. These two tensors are then concatenated earlier than being forwarded to H\u2082<\/em>. This primarily signifies that the H\u2082 <\/em>layer accepts 10 channels. Following the identical sample, we are going to later have the H\u2083<\/em>, H\u2084<\/em>, and the transition<\/em> layers to just accept tensors of 14, 18, and 22 channels, respectively. That is really an instance of a DenseNet that makes use of the development fee<\/em> parameter of 4, which means that every layer produces 4 new function maps. In a while, we are going to use okay<\/em> to indicate this parameter as prompt within the unique paper.<\/p>\n
Regardless of having such complicated connections, DenseNet is definitely much more environment friendly as in comparison with the standard CNN by way of the variety of parameters. Let\u2019s perform a little little bit of math to show this. The construction given in Determine 1 consists of 4 conv layers (let\u2019s ignore the transition<\/em> layer for now). To compute what number of parameters a convolution layer has, we are able to merely calculate input_channels<\/em> \u00d7 kernel_height<\/em> \u00d7 kernel_width<\/em> \u00d7 output_channels<\/em>. Assuming that every one these convolutions use 3\u00d73 kernel, our layers within the DenseNet structure would have the next variety of parameters:<\/p>\n
\n
H\u2081<\/em> \u2192 6\u00d73\u00d73\u00d74 = 216<\/li>\n
H\u2082<\/em> \u2192 10\u00d73\u00d73\u00d74 = 360<\/li>\n
H\u2083<\/em> \u2192 14\u00d73\u00d73\u00d74 = 504<\/li>\n
H\u2084<\/em> \u2192 18\u00d73\u00d73\u00d74 = 648<\/li>\n<\/ul>\n
By summing these 4 numbers, we may have 1,728 params in whole. Observe that this quantity doesn’t embrace the bias time period. Now if we attempt to create the very same construction with a standard CNN, we would require the next variety of params for every layer:<\/p>\n
\n
H\u2081<\/em> \u2192 6\u00d73\u00d73\u00d710 = 540\u00a0<\/li>\n
H\u2082<\/em> \u2192 10\u00d73\u00d73\u00d714 = 1,260<\/li>\n
H\u2083<\/em> \u2192 14\u00d73\u00d73\u00d718 = 2,268<\/li>\n
H\u2084<\/em> \u2192 18\u00d73\u00d73\u00d722 = 3,564<\/li>\n<\/ul>\n
Summing these up, a standard CNN hits 7,632 params\u200a\u2014\u200athat\u2019s over 4\u00d7 greater! With this parameter rely in thoughts, we are able to clearly see that DenseNet is certainly way more light-weight than conventional CNNs. The explanation why DenseNet may be so environment friendly is due to the function reuse mechanism, the place as an alternative of computing all function maps from scratch, it solely computes okay<\/em> function maps and concatenate them with the prevailing function maps from the earlier layers.<\/p>\n
\n
Transition Layer<\/h3>\n
The construction I confirmed you earlier is definitely simply the primary constructing block of the DenseNet mannequin, which is known as the dense block<\/em>. Determine 4 under exhibits how these constructing blocks are assembled, the place three of them are linked by the so-called transition layers<\/em>. Every transition layer consists of a convolution adopted by a pooling layer. This part has two most important obligations: first, to cut back the spatial dimension of the tensor, and second, to cut back the variety of channels. The discount in spatial dimension is normal follow when setting up CNN-based mannequin, the place the deeper function maps ought to usually have decrease dimension than that of the shallower ones. In the meantime, lowering the variety of channels is critical as a result of they could drastically enhance because of the channel-wise concatenation mechanism achieved inside every layer within the dense block.<\/p>\n
$\"\"$
Determine 4. The upper-level view of the DenseNet structure. The convolution-pooling pair is the so-called transition layer\u00a0[1].<\/figcaption><\/figure>\n
To know how the transition layer reduces channels, we have to take a look at the compression issue<\/em> parameter. This parameter, which the authors consult with as \u03b8<\/em> (theta), ought to have the worth of someplace between 0 and 1. Suppose we set \u03b8<\/em> to 0.2, then the variety of channels to be forwarded to the subsequent dense block will solely be 20% of the full variety of channels produced by the present dense block.<\/p>\n
\n
The Whole DenseNet Structure<\/h3>\n
As we’ve got understood the dense<\/em> block and the transition<\/em> layer, we are able to now transfer on to the whole DenseNet structure proven in Determine 5 under. It initially accepts an RGB picture of measurement 224\u00d7224, which is then processed by a 7\u00d77 conv and a 3\u00d73 maxpooling layer. Take into account that these two layers use the stride of two, inflicting the spatial dimension to shrink to 112\u00d7112 and 56\u00d756, respectively. At this level the tensor is able to be handed by way of the primary dense block which consists of 6 bottleneck<\/em> blocks\u200a\u2014\u200aI\u2019ll speak extra about this part very quickly. The ensuing output will then be forwarded to the primary transition layer, adopted by the second dense block, and so forth till we ultimately attain the worldwide common pooling layer. Lastly, we cross the tensor to the fully-connected layer which is accountable for making class predictions.<\/p>\n
$\"\"$
Determine 5. The whole DenseNet structure [1].<\/figcaption><\/figure>\n
There are literally a number of extra particulars I want to elucidate relating to the structure above. First, the variety of function maps produced in every step isn’t explicitly talked about. That is primarily as a result of the structure is adaptive in accordance with the okay<\/em> and \u03b8<\/em> parameters. The one layer with a hard and fast quantity is the very first convolution layer (the 7\u00d77 one), which produces 64 function maps (not displayed within the determine). Second, it’s also essential to notice that each convolution layer proven within the structure follows the BN-ReLU-conv-dropout<\/em> sequence, aside from the 7\u00d77 convolution which doesn’t embrace the dropout layer. Third, the authors carried out a number of DenseNet variants, which they consult with as DenseNet (the vanilla one), DenseNet-B (the variant that makes use of bottleneck<\/em> blocks), DenseNet-C (the one which makes use of compression issue \u03b8<\/em>), and DenseNet-BC (the variant that employs each). The structure given in Determine 5 is the DenseNet-B (or DenseNet-BC) variant.\u00a0<\/p>\n
The so-called bottleneck<\/em> block itself is the stack of 1\u00d71 and three\u00d73 convolutions. The 1\u00d71 conv is used to cut back the variety of channels to 4okay<\/em> earlier than ultimately being shrunk additional to okay<\/em> by the next 3\u00d73 conv. The explanation for it’s because 3\u00d73 convolution is computationally costly on tensors with many channels. So to make the computation sooner, we have to scale back the channels first utilizing the 1\u00d71 conv. Later within the coding part we’re going to implement this DenseNet-BC variant. Nevertheless, if you wish to implement the usual DenseNet (or DenseNet-C) as an alternative, you possibly can merely omit the 1\u00d71 conv so that every dense block solely includes 3\u00d73 convolutions.<\/p>\n
\n
Some Experimental Outcomes<\/h3>\n
It’s seen within the paper that the authors carried out a number of experiments evaluating DenseNet with different fashions. On this part I’m going to indicate you some fascinating issues they found.<\/p>\n
$\"\"$
Determine 6. DenseNet achieves higher accuracy than ResNet with fewer parameters and decrease computational value throughout totally different community depths [1].<\/figcaption><\/figure>\n
The primary experimental consequence I discovered fascinating is that DenseNet really has a lot better efficiency than ResNet. Determine 6 above exhibits that it persistently outperforms ResNet throughout all community depths. When evaluating variants with related accuracy, DenseNet is definitely much more environment friendly. Let\u2019s take a more in-depth take a look at the DenseNet-201 variant. Right here you possibly can see that the validation error is almost the identical as ResNet-101. Regardless of being 2\u00d7 deeper (201 vs 101 layers), it’s roughly 2\u00d7 smaller by way of each parameters and FLOPs (floating level operations).<\/p>\n
$\"\"$
Determine 7. How bottleneck layer and compression issue have an effect on mannequin efficiency [1].<\/figcaption><\/figure>\n
Subsequent, the authors additionally carried out ablation examine relating to the usage of bottleneck layer and compression issue. We will see in Determine 7 above that using each the bottleneck layer inside the dense block and performing channel rely discount within the transition layer permits the mannequin to realize greater accuracy (DenseNet-BC). It may appear a bit counterintuitive to see that the discount within the variety of channels because of the compression issue improves the accuracy as an alternative. In truth, in deep studying, too many options may as an alternative harm accuracy resulting from info redundancy. So, lowering the variety of channels may be perceived as a regularization mechanism which may stop the mannequin from overfitting, permitting it to acquire greater validation accuracy.<\/p>\n
\n
DenseNet From\u00a0Scratch<\/h2>\n
As we’ve got understood the underlying concept behind DenseNet, we are able to now implement the structure from scratch. What we have to do first is to import the required modules and initializing the configurable variables. Within the Codeblock 1 under, the okay<\/em> and \u03b8<\/em> we mentioned earlier are denoted as GROWTH<\/code> and COMPRESSION<\/code>, which the values are set to 12 and 0.5, respectively. These two values are the defaults given within the paper, which we are able to undoubtedly change if we wish to. Subsequent, right here I additionally initialize the REPEATS<\/code> checklist to retailer the variety of bottleneck blocks inside every dense block.<\/p>\n
# Codeblock 1\nimport torch\nimport torch.nn as nn\n\nGROWTH = 12\nCOMPRESSION = 0.5\nREPEATS = [6, 12, 24, 16]<\/code><\/pre>\nBottleneck Implementation<\/h3>\nNow let\u2019s check out the Bottleneck<\/code> class under to see how I implement the stack of 1\u00d71 and three\u00d73 convolutions. Beforehand I\u2019ve talked about that every convolution layer follows the BN-ReLU-Conv-dropout<\/em> construction, so right here we have to initialize all these layers within the __init__()<\/code> technique.<\/p>\n The 2 convolution layers are initialized as conv0<\/code> and conv1<\/code>, every with their corresponding batch normalization layers. Don\u2019t neglect to set the out_channels<\/code> parameter of the conv0<\/code> layer to GROWTH*4<\/code> as a result of we would like it to return 4okay<\/em> function maps (see the road marked with #(1)<\/code>). This variety of function maps will then be shrunk even additional by the conv1<\/code> layer to okay<\/em> by setting the out_channels<\/code> to GROWTH<\/code> (#(2)<\/code>). As all layers have been initialized, we are able to now outline the move within the ahead()<\/code> technique. Simply understand that on the finish of the method we’ve got to concatenate the ensuing tensor (out<\/code>) with the unique one (x<\/code>) to implement the skip-connection (#(3)<\/code>).<\/p>\n # Codeblock 2\nclass Bottleneck(nn.Module):\n def __init__(self, in_channels):\n tremendous().__init__()\n \n self.relu = nn.ReLU()\n self.dropout = nn.Dropout(p=0.2)\n \n self.bn0 = nn.BatchNorm2d(num_features=in_channels)\n self.conv0 = nn.Conv2d(in_channels=in_channels, \n out_channels=GROWTH*4, #(1) \n kernel_size=1, \n padding=0, \n bias=False)\n \n self.bn1 = nn.BatchNorm2d(num_features=GROWTH*4)\n self.conv1 = nn.Conv2d(in_channels=GROWTH*4, \n out_channels=GROWTH, #(2)\n kernel_size=3, \n padding=1, \n bias=False)\n \n def ahead(self, x):\n print(f'originalt: {x.measurement()}')\n \n out = self.dropout(self.conv0(self.relu(self.bn0(x))))\n print(f'after conv0t: {out.measurement()}')\n \n out = self.dropout(self.conv1(self.relu(self.bn1(out))))\n print(f'after conv1t: {out.measurement()}')\n \n concatenated = torch.cat((out, x), dim=1) #(3)\n print(f'after concatt: {concatenated.measurement()}')\n \n return concatenated<\/code><\/pre>\nWith a view to examine if our Bottleneck<\/code> class works correctly, we are going to now create one which accepts 64 function maps and cross a dummy tensor by way of it. The bottleneck layer I instantiate under primarily corresponds to the very first bottleneck inside the primary dense block (refer again to Determine 5 should you\u2019re not sure). So, to simulate precise the move of the community, we’re going to cross a tensor of measurement 64\u00d756\u00d756, which is actually the form produced by the three\u00d73 maxpooling layer.<\/p>\n # Codeblock 3\nbottleneck = Bottleneck(in_channels=64)\n\nx = torch.randn(1, 64, 56, 56)\nx = bottleneck(x)<\/code><\/pre>\nAs soon as the above code is run, we are going to get the next output seem on our display screen.<\/p>\n # Codeblock 3 Output\nunique : torch.Measurement([1, 64, 56, 56])\nafter conv0 : torch.Measurement([1, 48, 56, 56]) #(1)\nafter conv1 : torch.Measurement([1, 12, 56, 56]) #(2)\nafter concat : torch.Measurement([1, 76, 56, 56])<\/code><\/pre>\n \nRight here we are able to see that our conv0<\/code> layer efficiently decreased the function maps from 64 to 48 (#(1)<\/code>), the place 48 is the 4okay<\/em> (do not forget that our okay<\/em> is 12). This 48-channel tensor is then processed by the conv1<\/code> layer, which reduces the variety of function maps even additional to okay<\/em> (#(2)<\/code>). This output tensor is then concatenated with the unique one, leading to a tensor of 64+12 = 76 function maps. And right here is definitely the place the sample begins. Later within the dense block, if we repeat this bottleneck a number of instances, then we may have every layer produce:<\/p>\n \nsecond layer \u2192 64+(2\u00d712) = 88 function maps<\/li>\n third layer \u2192 64+(3\u00d712) = 100 function maps<\/li>\n fourth layer \u2192 64+(4\u00d712) = 112 function maps<\/li>\n and so forth\u00a0\u2026<\/li>\n<\/ul>\n \nDense Block Implementation<\/h3>\nNow let\u2019s really create the DenseBlock<\/code> class to retailer the sequence of Bottleneck<\/code> situations. Take a look at the Codeblock 4 under to see how I try this. The best way to do it’s fairly straightforward, we are able to simply initialize a module checklist (#(1)<\/code>) after which append the bottleneck blocks one after the other (#(3)<\/code>). Observe that we have to preserve monitor of the variety of enter channels of every bottleneck utilizing the current_in_channels<\/code> variable (#(2)<\/code>). Lastly, within the ahead()<\/code> technique we are able to merely cross the tensor sequentially.<\/p>\n # Codeblock 4\nclass DenseBlock(nn.Module):\n def __init__(self, in_channels, repeats):\n tremendous().__init__()\n \n self.bottlenecks = nn.ModuleList() #(1)\n \n for i in vary(repeats):\n current_in_channels = in_channels + i*GROWTH #(2)\n self.bottlenecks.append(Bottleneck(in_channels=current_in_channels)) #(3)\n \n def ahead(self, x):\n for i, bottleneck in enumerate(self.bottlenecks):\n x = bottleneck(x)\n print(f'after bottleneck #{i}t: {x.measurement()}')\n \n return x<\/code><\/pre>\nWe will check the code above by simulating the primary dense block within the community. You’ll be able to see in Determine 5 that it incorporates 6 bottleneck blocks, so within the Codeblock 5 under I set the repeats<\/code> parameter to that quantity (#(1)<\/code>). We will see within the ensuing output that the enter tensor, which initially has the form of 64\u00d756\u00d756, is remodeled to 136\u00d756\u00d756. The 136 function maps come from 64+(6\u00d712), which follows the sample I gave you earlier.<\/p>\n # Codeblock 5\ndense_block = DenseBlock(in_channels=64, repeats=6) #(1)\nx = torch.randn(1, 64, 56, 56)\n\nx = dense_block(x)<\/code><\/pre>\n# Codeblock 5 Output\nafter bottleneck #0 : torch.Measurement([1, 76, 56, 56])\nafter bottleneck #1 : torch.Measurement([1, 88, 56, 56])\nafter bottleneck #2 : torch.Measurement([1, 100, 56, 56])\nafter bottleneck #3 : torch.Measurement([1, 112, 56, 56])\nafter bottleneck #4 : torch.Measurement([1, 124, 56, 56])\nafter bottleneck #5 : torch.Measurement([1, 136, 56, 56])<\/code><\/pre>\n \nTransition Layer<\/h3>\nThe following part we’re going to implement is the transition<\/em> layer, which is proven in Codeblock 6 under. Much like the convolution layers within the bottleneck blocks, right here we additionally use the BN-ReLU-conv-dropout<\/em> construction, but this one is with a further common pooling layer on the finish (#(1)<\/code>). Don\u2019t neglect to set the stride of this pooling layer to 2 to cut back the spatial dimension by half.<\/p>\n # Codeblock 6\nclass Transition(nn.Module):\n def __init__(self, in_channels, out_channels):\n tremendous().__init__()\n \n self.bn = nn.BatchNorm2d(num_features=in_channels)\n self.relu = nn.ReLU()\n self.conv = nn.Conv2d(in_channels=in_channels, \n out_channels=out_channels, \n kernel_size=1, \n padding=0,\n bias=False)\n self.dropout = nn.Dropout(p=0.2)\n self.pool = nn.AvgPool2d(kernel_size=2, stride=2) #(1)\n \n def ahead(self, x):\n print(f'originalt: {x.measurement()}')\n \n out = self.pool(self.dropout(self.conv(self.relu(self.bn(x)))))\n print(f'after transition: {out.measurement()}')\n \n return out<\/code><\/pre>\nNow let\u2019s check out the testing code within the Codeblock 7 under to see how a tensor transforms as it’s handed by way of the above community. On this instance I’m making an attempt to simulate the very first transition layer, i.e., the one proper after the primary dense block. That is primarily the rationale that I set this layer to just accept 136 channels. Beforehand I discussed that this layer is used to shrink the channel dimension by way of the \u03b8<\/em> parameter, so to implement it we are able to merely multiply the variety of enter function maps with the COMPRESSION<\/code> variable for the out_channels<\/code> parameter.<\/p>\n # Codeblock 7\ntransition = Transition(in_channels=136, out_channels=int(136*COMPRESSION))\n\nx = torch.randn(1, 136, 56, 56)\nx = transition(x)<\/code><\/pre>\nAs soon as above code is run, we must always get hold of the next output. Right here you possibly can see that the spatial dimension of the enter tensor shrinks from 56\u00d756 to twenty-eight\u00d728, whereas the variety of channels additionally reduces from 136 to 68. This primarily signifies that our transition layer implementation is right.<\/p>\n # Codeblock 7 Output\nunique : torch.Measurement([1, 136, 56, 56])\nafter transition : torch.Measurement([1, 68, 28, 28])<\/code><\/pre>\n \nThe Whole DenseNet Structure<\/h3>\nAs we’ve got efficiently carried out the primary elements of the DenseNet mannequin, we are actually going to assemble the complete structure. Right here I separate the __init__()<\/code> and the ahead()<\/code> strategies into two codeblocks as they’re fairly lengthy. Simply be sure that you set Codeblock 8a and 8b inside the identical pocket book cell if you wish to run it by yourself.<\/p>\n # Codeblock 8a\nclass DenseNet(nn.Module):\n def __init__(self):\n tremendous().__init__()\n \n self.first_conv = nn.Conv2d(in_channels=3, \n out_channels=64, \n kernel_size=7, #(1)\n stride=2, #(2)\n padding=3, #(3)\n bias=False)\n self.first_pool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1) #(4)\n channel_count = 64\n \n\n # Dense block #0\n self.dense_block_0 = DenseBlock(in_channels=channel_count,\n repeats=REPEATS[0]) #(5)\n channel_count = int(channel_count+REPEATS[0]*GROWTH) #(6)\n self.transition_0 = Transition(in_channels=channel_count, \n out_channels=int(channel_count*COMPRESSION))\n channel_count = int(channel_count*COMPRESSION) #(7)\n \n\n # Dense block #1\n self.dense_block_1 = DenseBlock(in_channels=channel_count, \n repeats=REPEATS[1])\n channel_count = int(channel_count+REPEATS[1]*GROWTH)\n self.transition_1 = Transition(in_channels=channel_count, \n out_channels=int(channel_count*COMPRESSION))\n channel_count = int(channel_count*COMPRESSION)\n\n # # Dense block #2\n self.dense_block_2 = DenseBlock(in_channels=channel_count, \n repeats=REPEATS[2])\n channel_count = int(channel_count+REPEATS[2]*GROWTH)\n \n self.transition_2 = Transition(in_channels=channel_count, \n out_channels=int(channel_count*COMPRESSION))\n channel_count = int(channel_count*COMPRESSION)\n\n # Dense block #3\n self.dense_block_3 = DenseBlock(in_channels=channel_count, \n repeats=REPEATS[3])\n channel_count = int(channel_count+REPEATS[3]*GROWTH)\n \n \n self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1)) #(8)\n self.fc = nn.Linear(in_features=channel_count, out_features=1000) #(9)<\/code><\/pre>\nWhat we do first within the __init__()<\/code> technique above is to initialize the first_conv<\/code> and the first_pool<\/code> layers. Take into account that these two layers neither belong to the dense block nor the transition layer, so we have to manually initialize them as nn.Conv2d<\/code> and nn.MaxPool2d<\/code> situations. In truth, these two preliminary layers are fairly distinctive. The convolution layer makes use of a really massive kernel of measurement 7\u00d77 (#(1)<\/code>) with the stride of two (#(2)<\/code>). So, not solely capturing info from massive space, however this layer additionally performs spatial downsampling in-place. Right here we additionally must set the padding to three (#(3)<\/code>) to compensate for the big kernel in order that the spatial dimension doesn\u2019t get decreased an excessive amount of. Subsequent, the pooling layer is totally different from those within the transition layer, the place we use 3\u00d73 maxpooling reasonably than 2\u00d72 common pooling (#(4)<\/code>).<\/p>\n As the primary two layers are achieved, what we do subsequent is to initialize the dense blocks and the transition layers. The concept is fairly simple, the place we have to initialize the dense blocks consisting of a number of bottleneck blocks (which the quantity bottlenecks is handed by way of the repeats<\/code> parameter (#(5)<\/code>)). Keep in mind to maintain monitor of the channel rely of every step (#(6,7)<\/code>) in order that we are able to match the enter form of the next layer with the output form of the earlier one. After which we principally do the very same factor for the remaining dense blocks and the transition layers.<\/p>\n As we’ve got reached the final dense block, we now initialize the worldwide common pooling layer (#(8)<\/code>), which is accountable for taking the common worth throughout the spatial dimension, earlier than ultimately initializing the classification head (#(9)<\/code>). Lastly, as all layers have been initialized, we are able to now join all of them contained in the ahead()<\/code> technique under.<\/p>\n # Codeblock 8b\n def ahead(self, x):\n print(f'originaltt: {x.measurement()}')\n \n x = self.first_conv(x)\n print(f'after first_convt: {x.measurement()}')\n \n x = self.first_pool(x)\n print(f'after first_poolt: {x.measurement()}')\n \n x = self.dense_block_0(x)\n print(f'after dense_block_0t: {x.measurement()}')\n \n x = self.transition_0(x)\n print(f'after transition_0t: {x.measurement()}')\n\n x = self.dense_block_1(x)\n print(f'after dense_block_1t: {x.measurement()}')\n \n x = self.transition_1(x)\n print(f'after transition_1t: {x.measurement()}')\n \n x = self.dense_block_2(x)\n print(f'after dense_block_2t: {x.measurement()}')\n \n x = self.transition_2(x)\n print(f'after transition_2t: {x.measurement()}')\n \n x = self.dense_block_3(x)\n print(f'after dense_block_3t: {x.measurement()}')\n \n x = self.avgpool(x)\n print(f'after avgpooltt: {x.measurement()}')\n \n x = torch.flatten(x, start_dim=1)\n print(f'after flattentt: {x.measurement()}')\n \n x = self.fc(x)\n print(f'after fctt: {x.measurement()}')\n \n return x<\/code><\/pre>\nThat\u2019s principally all the implementation of the DenseNet structure. We will check if it really works correctly by operating the Codeblock 9 under. Right here we cross the x<\/code> tensor by way of the community, by which it simulates a batch of a single 224\u00d7224 RGB picture.<\/p>\n # Codeblock 9\ndensenet = DenseNet()\nx = torch.randn(1, 3, 224, 224)\n\nx = densenet(x)<\/code><\/pre>\nAnd under is what the output seems to be like. Right here I deliberately print out the tensor form after every step so to clearly see how the tensor transforms all through the complete community. Regardless of having so many layers, that is really the smallest DenseNet variant, i.e., DenseNet-121. You’ll be able to really make the mannequin even bigger by altering the values within the REPEATS<\/code> checklist in accordance with the variety of bottleneck blocks inside every dense block given in Determine 5.<\/p>\n # Codeblock 9 Output\nunique : torch.Measurement([1, 3, 224, 224])\nafter first_conv : torch.Measurement([1, 64, 112, 112])\nafter first_pool : torch.Measurement([1, 64, 56, 56])\nafter bottleneck #0 : torch.Measurement([1, 76, 56, 56])\nafter bottleneck #1 : torch.Measurement([1, 88, 56, 56])\nafter bottleneck #2 : torch.Measurement([1, 100, 56, 56])\nafter bottleneck #3 : torch.Measurement([1, 112, 56, 56])\nafter bottleneck #4 : torch.Measurement([1, 124, 56, 56])\nafter bottleneck #5 : torch.Measurement([1, 136, 56, 56])\nafter dense_block_0 : torch.Measurement([1, 136, 56, 56])\nafter transition_0 : torch.Measurement([1, 68, 28, 28])\nafter bottleneck #0 : torch.Measurement([1, 80, 28, 28])\nafter bottleneck #1 : torch.Measurement([1, 92, 28, 28])\nafter bottleneck #2 : torch.Measurement([1, 104, 28, 28])\nafter bottleneck #3 : torch.Measurement([1, 116, 28, 28])\nafter bottleneck #4 : torch.Measurement([1, 128, 28, 28])\nafter bottleneck #5 : torch.Measurement([1, 140, 28, 28])\nafter bottleneck #6 : torch.Measurement([1, 152, 28, 28])\nafter bottleneck #7 : torch.Measurement([1, 164, 28, 28])\nafter bottleneck #8 : torch.Measurement([1, 176, 28, 28])\nafter bottleneck #9 : torch.Measurement([1, 188, 28, 28])\nafter bottleneck #10 : torch.Measurement([1, 200, 28, 28])\nafter bottleneck #11 : torch.Measurement([1, 212, 28, 28])\nafter dense_block_1 : torch.Measurement([1, 212, 28, 28])\nafter transition_1 : torch.Measurement([1, 106, 14, 14])\nafter bottleneck #0 : torch.Measurement([1, 118, 14, 14])\nafter bottleneck #1 : torch.Measurement([1, 130, 14, 14])\nafter bottleneck #2 : torch.Measurement([1, 142, 14, 14])\nafter bottleneck #3 : torch.Measurement([1, 154, 14, 14])\nafter bottleneck #4 : torch.Measurement([1, 166, 14, 14])\nafter bottleneck #5 : torch.Measurement([1, 178, 14, 14])\nafter bottleneck #6 : torch.Measurement([1, 190, 14, 14])\nafter bottleneck #7 : torch.Measurement([1, 202, 14, 14])\nafter bottleneck #8 : torch.Measurement([1, 214, 14, 14])\nafter bottleneck #9 : torch.Measurement([1, 226, 14, 14])\nafter bottleneck #10 : torch.Measurement([1, 238, 14, 14])\nafter bottleneck #11 : torch.Measurement([1, 250, 14, 14])\nafter bottleneck #12 : torch.Measurement([1, 262, 14, 14])\nafter bottleneck #13 : torch.Measurement([1, 274, 14, 14])\nafter bottleneck #14 : torch.Measurement([1, 286, 14, 14])\nafter bottleneck #15 : torch.Measurement([1, 298, 14, 14])\nafter bottleneck #16 : torch.Measurement([1, 310, 14, 14])\nafter bottleneck #17 : torch.Measurement([1, 322, 14, 14])\nafter bottleneck #18 : torch.Measurement([1, 334, 14, 14])\nafter bottleneck #19 : torch.Measurement([1, 346, 14, 14])\nafter bottleneck #20 : torch.Measurement([1, 358, 14, 14])\nafter bottleneck #21 : torch.Measurement([1, 370, 14, 14])\nafter bottleneck #22 : torch.Measurement([1, 382, 14, 14])\nafter bottleneck #23 : torch.Measurement([1, 394, 14, 14])\nafter dense_block_2 : torch.Measurement([1, 394, 14, 14])\nafter transition_2 : torch.Measurement([1, 197, 7, 7])\nafter bottleneck #0 : torch.Measurement([1, 209, 7, 7])\nafter bottleneck #1 : torch.Measurement([1, 221, 7, 7])\nafter bottleneck #2 : torch.Measurement([1, 233, 7, 7])\nafter bottleneck #3 : torch.Measurement([1, 245, 7, 7])\nafter bottleneck #4 : torch.Measurement([1, 257, 7, 7])\nafter bottleneck #5 : torch.Measurement([1, 269, 7, 7])\nafter bottleneck #6 : torch.Measurement([1, 281, 7, 7])\nafter bottleneck #7 : torch.Measurement([1, 293, 7, 7])\nafter bottleneck #8 : torch.Measurement([1, 305, 7, 7])\nafter bottleneck #9 : torch.Measurement([1, 317, 7, 7])\nafter bottleneck #10 : torch.Measurement([1, 329, 7, 7])\nafter bottleneck #11 : torch.Measurement([1, 341, 7, 7])\nafter bottleneck #12 : torch.Measurement([1, 353, 7, 7])\nafter bottleneck #13 : torch.Measurement([1, 365, 7, 7])\nafter bottleneck #14 : torch.Measurement([1, 377, 7, 7])\nafter bottleneck #15 : torch.Measurement([1, 389, 7, 7])\nafter dense_block_3 : torch.Measurement([1, 389, 7, 7])\nafter avgpool : torch.Measurement([1, 389, 1, 1])\nafter flatten : torch.Measurement([1, 389])\nafter fc : torch.Measurement([1, 1000])<\/code><\/pre>\n \nEnding<\/h2>\nI believe that\u2019s just about every part concerning the concept and the implementation of the DenseNet mannequin. It’s also possible to discover all of the codes above in my GitHub repo [2]. See ya in my subsequent article!\u00a0<\/p>\n \n References<\/h2>\n[1] Gao Huang et al.<\/em> Densely Linked Convolutional Networks. Arxiv. https:\/\/arxiv.org\/abs\/1608.06993<\/a> [Accessed September 18, 2025].<\/p>\n [2] MuhammadArdiPutra. DenseNet. GitHub. https:\/\/github.com\/MuhammadArdiPutra\/medium_articles\/blob\/most important\/DenseNet.ipynb<\/a> [Accessed September 18, 2025].<\/p>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":" we attempt to practice a really deep neural community mannequin, one problem that we’d encounter is the vanishing gradient downside. That is primarily an issue the place the load replace of a mannequin throughout coaching slows down and even stops, therefore inflicting the mannequin to not enhance. When a community could be very deep, the […]<\/p>\n","protected":false},"author":2,"featured_media":13386,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[2649,8499,424,6776],"class_list":["post-13384","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-connected","tag-densenet","tag-paper","tag-walkthrough"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/13384","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13384"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/13384\/revisions"}],"predecessor-version":[{"id":13385,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/13384\/revisions\/13385"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/13386"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13384"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13384"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13384"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}

The DenseNet Structure<\/h2>\n

Ending<\/h2>\nI believe that\u2019s just about every part concerning the concept and the implementation of the DenseNet mannequin. It’s also possible to discover all of the codes above in my GitHub repo [2]. See ya in my subsequent article!\u00a0<\/p>\n\n

Ending<\/h2>\n
I believe that\u2019s just about every part concerning the concept and the implementation of the DenseNet mannequin. It’s also possible to discover all of the codes above in my GitHub repo [2]. See ya in my subsequent article!\u00a0<\/p>\n
\n