tag:blogger.com,1999:blog-18553711172554757652024-03-13T17:50:59.865+02:00Parallel++A blog about personal interests like computer programming etc.Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.comBlogger41125tag:blogger.com,1999:blog-1855371117255475765.post-40517549402475193332018-02-25T19:45:00.001+02:002018-02-25T19:45:57.723+02:00clmempatterns: Benchmarking GPU memory access strides<div style="text-align: justify;">
Typically, streaming memory loads on GPUs are applied sequentially by the programmer by assigning sequential threads to sequential addresses in order to enforce coalescing. On the other hand, CPUs tend to favor sequentially accessed addresses by each individual thread in order to increase the degree of spatial locality and make better use of cache. Are there any intermediate patterns between these two extreme cases? How would a CPU or GPU device behave under an intermediate situation?</div>
<br />
<div style="text-align: justify;">
Here is where <a href="https://github.com/ekondis/clmempatterns" target="_blank">oclmempatterns</a> benchmark tool comes into play. This benchmark leverages OpenCL to explore the memory performance under different access stride selections. Accessing a memory space is benchmarked by applying all possible access strides that are integer powers of 2, till the amount of total threads is reached. For example, imagine in a simplified scenario that we have to access 16 elements by using a total of 4 threads. For CPUs a good choice is typically using single strided accesses as shown in the figure below.</div>
<br />
<div style="text-align: justify;">
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlEYDHvhUJUNZ4d8SeNtucnn4L9TZ0FSyO_Po-Wn-o-nUROWoiLX5bZZ6ilB6ecoAzCAwEhPunJDqALUI4xSG_6mE5WljSYr4omOdblaZ0LCc9sBx2zIRKwg0APxF6W2hKf1nDedeOAsTI/s1600/clmempatterns-s1.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="343" data-original-width="799" height="272" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlEYDHvhUJUNZ4d8SeNtucnn4L9TZ0FSyO_Po-Wn-o-nUROWoiLX5bZZ6ilB6ecoAzCAwEhPunJDqALUI4xSG_6mE5WljSYr4omOdblaZ0LCc9sBx2zIRKwg0APxF6W2hKf1nDedeOAsTI/s640/clmempatterns-s1.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Accessing memory with unit strides</td></tr>
</tbody></table>
However, on GPUs a fairly good choice is typically using strides equal to the total amount of threads. This would apply accesses as shown below:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYangK14YD3POvAaTgKYYj0nlD1LawOKfXA-vDIs4RkX2j1JTyQ2jNdDY6FbTlJVfu7kzIlcllYCu1ppfkgKuOB5E3fJtUWvE3YN24DWfwCJecIsq93x1vHWIYoVUG2oLi4-uCSzoSbDK9/s1600/clmempatterns-s4.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="343" data-original-width="799" height="272" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYangK14YD3POvAaTgKYYj0nlD1LawOKfXA-vDIs4RkX2j1JTyQ2jNdDY6FbTlJVfu7kzIlcllYCu1ppfkgKuOB5E3fJtUWvE3YN24DWfwCJecIsq93x1vHWIYoVUG2oLi4-uCSzoSbDK9/s640/clmempatterns-s4.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">Accessing memory with strides of 4, which equals to the amount of total threads</td></tr>
</tbody></table>
<div style="text-align: justify;">
However, there are many intermediate cases where we can apply various strides. In this simplified example we could apply strides of 2 as shown below:</div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjzFmCcZlNuKtJOoM5zpwkAkt2o3BGxyqsQWYawnmKg2-tFxfhTXyrjHcelU5oHPA6gI88mifAsnwopbG6caN0yX06JvxPfPsdwGzzU9gK0UqdNHyyOo5e-Y_wQCopNdG2LqHHZV3bQBxf0/s1600/clmempatterns-s2.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="343" data-original-width="799" height="272" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjzFmCcZlNuKtJOoM5zpwkAkt2o3BGxyqsQWYawnmKg2-tFxfhTXyrjHcelU5oHPA6gI88mifAsnwopbG6caN0yX06JvxPfPsdwGzzU9gK0UqdNHyyOo5e-Y_wQCopNdG2LqHHZV3bQBxf0/s640/clmempatterns-s2.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Accessing memory with strides of 2</td></tr>
</tbody></table>
<div style="text-align: justify;">
In a real examples there can be tens of different cases as the memory space can be large, each applying a different power of two stride. If we assume that we have a total amount of N elements quantified as an exact power of 2 then an exact amount of log<sub>2</sub>(N) bits are required to address them. For instance, if 2<sup>26</sup> elements are accessed then 26 bits are required for indexing these elements. Having a smaller amount of threads to process these elements, e.g. 2<sup>20</sup>, would yield a thread index space of 20 bits total. So, by using strides equal of the total thread index space would lead to the following representation whole element space:<br />
<div style="text-align: start;">
<br /></div>
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<div style="text-align: justify;">
</div>
Each cell represents a bit. These 26 bits consist the whole element space. Red bits represent the part of the address that is designated by the thread stride and the green bits are designated by the thread index. This means that each thread uses its thread index to define the green part of the address and thereafter enumerates sequentially each possible value of the red part, applying the memory access for each element address.<br />
<br />
Of course there are other intermediate cases as seen bellow that are tested by the benchmark:<br />
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
<table style="border-spacing: 0px;"><tbody>
<tr>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>25</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>24</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>23</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>22</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>21</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>20</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>19</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>18</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>17</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>16</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>15</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>14</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>13</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>12</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>11</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>10</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>09</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>08</sub></td>
<td style="background-color: springgreen; border: 1px solid grey;">b<sub>07</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>06</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>05</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>04</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>03</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>02</sub></td>
<td style="background-color: lightcoral; border: 1px solid grey;">b<sub>01</sub></td>
</tr>
</tbody></table>
<br />
Each one corresponds to a different shift of the red part in the whole address representation. The last one is the other extreme case typically used on CPUs where each thread accesses elements residing on sequential addresses.<br />
<br />
So, what would the memory access bandwidth would be in all these cases? This is the purpose of <a href="https://github.com/ekondis/clmempatterns" target="_blank">clmempatterns </a>benchmark tool. In the figure below you can see measurements of memory bandwidth by using this tool to access 64M of int elements by using 1M of total threads on a GTX-1060 GPU. As seen using any power of two stride from 32 and beyond leads to good memory bandwidth.</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhu2LYnk-uyU0f0wKkOZ6paoowKgPe3G7gw1UTWzOZjsr12_QmcYbb9ubnTMo39vcunrm0OOyw-AWm1-8KoE2gkWB3wmi2wmjk_Lf1wbEQQDBg-UNaxgQxfuzlgkdsqq2KR4s7hR9-O2Zsw/s1600/clmempatterns-GTX-1060.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img alt="" border="0" data-original-height="600" data-original-width="800" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhu2LYnk-uyU0f0wKkOZ6paoowKgPe3G7gw1UTWzOZjsr12_QmcYbb9ubnTMo39vcunrm0OOyw-AWm1-8KoE2gkWB3wmi2wmjk_Lf1wbEQQDBg-UNaxgQxfuzlgkdsqq2KR4s7hR9-O2Zsw/s640/clmempatterns-GTX-1060.png" title="GTX-1060 benchmark results" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">clmempatterns benchmark execution on GTX-1060 (64M int elements, 1M grid size, 256 workitems/workgroup, granularity: 64 elements/workitem)</td></tr>
</tbody></table>
<br />
<div style="text-align: justify;">
The tool is open source and you may freely experiment with it. I would be glad to let me know about any interesting results you might get.<br />
<br />
URL: <a href="https://github.com/ekondis/clmempatterns">https://github.com/ekondis/clmempatterns</a></div>
<br />Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-58369019437578893902017-04-21T12:11:00.001+03:002017-04-21T12:11:03.016+03:00GDC17 AMD Ryzen CPU Optimization slides<a href="http://32ipi028l5q82yhj72224m8j.wpengine.netdna-cdn.com/wp-content/uploads/2017/03/GDC2017-Optimizing-For-AMD-Ryzen.pdf" target="_blank">Click here</a> to see the slides (in PDF format) of the presentation on GDC17 regarding optimizing the AMD Ryzen. In their slides, the yet unreleased CodeXL v2.3 is shown. Enjoy!<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://32ipi028l5q82yhj72224m8j.wpengine.netdna-cdn.com/wp-content/uploads/2017/03/GDC2017-Optimizing-For-AMD-Ryzen.pdf" target="_blank"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgEsSs8smDyv17rm1z2GqmyycznFfuFDOU5MeUrcWIHHyMPI54xzoj_iF0O5qPgori59iTbPe2YGznhsv5D4qeon66i_SdQqNKMJQALdpCdRDCupWQxOwDcWHViiK8Y6AKPLkH3NVQJK61J/s1600/opt-ryzen.jpg" /></a></div>
<br />Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com1tag:blogger.com,1999:blog-1855371117255475765.post-87477494870602676982017-01-22T18:58:00.000+02:002017-01-22T18:58:14.168+02:00Fine grained memory management on AMD Vega GPUs<br />
<div style="text-align: justify;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDrCH_PpJq1uQlTe2KVkDP0X9nLC1HDaeOo-khFu4iV0-9hhfmoS3WE7jeYMQaBZNDCNCBBnZWbI6-n0jdEo0LhE9xHoI1nZw4f1wNUTijlup45ofHsfy0kJXyVaTzuOV-BXOGkmfi1WBd/s1600/Vega-Final-Presentation-18-840x473.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDrCH_PpJq1uQlTe2KVkDP0X9nLC1HDaeOo-khFu4iV0-9hhfmoS3WE7jeYMQaBZNDCNCBBnZWbI6-n0jdEo0LhE9xHoI1nZw4f1wNUTijlup45ofHsfy0kJXyVaTzuOV-BXOGkmfi1WBd/s400/Vega-Final-Presentation-18-840x473.jpg" width="400" /></a>Recently, some advancements of the upcoming AMD Vega GPU architecture had been disclosed. One of the most interesting features as you might have noticed was the new memory architecture. The upcoming Vega GPUs are reportedly having support for fine grained memory management and employ fast HBM2 memories. In addition, the GPU employs a mechanism which AMD calls <i>high bandwidth cache controller</i> and the GPU memory is referred to as <i>high bandwidth cache</i>. In this sense they claim that it is used as a cache memory, that is a secondary level of cache by having the main system memory serving as the last level of the hierarchy. This fact makes me think that the Vega architecture might support the same fine grained shared virtual memory infrastructure that the NVidia Pascal already implements on its GP100 GPU but that is not clear yet. No reference to shared virtual memory was made so this is just a personal speculation. Only the fact that the GPU would be able to utilize the system memory does not imply that the GPU will utilize the same virtual addressing mechanism as the CPU does. However, it is reasonable to think this will be supported since AMD wants to grab a significant portion of the HPC market through its <a href="https://radeonopencompute.github.io/install.html" target="_blank">ROCm </a>platform.<br />
<br />
If this is the case then it will prove to be an exciting feature with the benefits I had referred to in <a href="http://parallelplusplus.blogspot.gr/2016/09/nvidia-pascals-gpu-architecture-most.html" target="_blank">a previous post of mine</a>.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjq7jjeTNnAGTSpoLzqohIUAQLMyONTNpDvbFfbtURYbi9tYT7FKLmt14o5D36zcDxb9OVr41RYYDnAdDN3HEwegCS0xSXj8Tx46IhXPGm3FldhCVFdSqAHtCxsG0OphBRlgzCbOOgFj_1_/s1600/slides-16.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjq7jjeTNnAGTSpoLzqohIUAQLMyONTNpDvbFfbtURYbi9tYT7FKLmt14o5D36zcDxb9OVr41RYYDnAdDN3HEwegCS0xSXj8Tx46IhXPGm3FldhCVFdSqAHtCxsG0OphBRlgzCbOOgFj_1_/s400/slides-16.jpg" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCT-f8y6k_FRYGJbl0oeTv4W0co3jqIzXthvakeQwSvqakSi5PofCw47UXGdu_lmSczVaijl7euq-k5bIhsEWbt_kXrBWggqHjis57mfXohS_qUxsTAvFDmwTgdwd_8ZSTIjBiGrm7gxuo/s1600/slides-13.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCT-f8y6k_FRYGJbl0oeTv4W0co3jqIzXthvakeQwSvqakSi5PofCw47UXGdu_lmSczVaijl7euq-k5bIhsEWbt_kXrBWggqHjis57mfXohS_qUxsTAvFDmwTgdwd_8ZSTIjBiGrm7gxuo/s400/slides-13.jpg" width="400" /></a></div>
<br />
<br />
<br />
<a href="http://www.anandtech.com/show/11002/the-amd-vega-gpu-architecture-teaser/2" target="_blank">Source: Anandtech</a></div>
Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-39086720210748555192016-12-26T08:49:00.001+02:002016-12-26T08:49:19.253+02:00OpenCL/ROCm clinfo output on AMD Fiji<div style="text-align: justify;">
This month with the release of AMD ROCm v1.4 we also had a taste of the preview version of the OpenCL runtime on ROCm. For anyone curious about it here is the clinfo output on an AMD R9-Nano GPU (<a href="https://gist.github.com/ekondis/0a96eeecdae5610da3dd5c861b433eae" style="text-align: start;" target="_blank">external URL on gist</a>):</div>
<br />
<pre class="prettyprint lang-html">Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (2300.5)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Fiji [Radeon R9 FURY / NANO Series]
Device Topology: PCI[ B#1, D#0, F#0 ]
Max compute units: 64
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1000Mhz
Address bits: 64
Max memory allocation: 3221225472
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 29440
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 4294967296
Constant buffer size: 3221225472
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 0
Max pipe active reservations: 0
Max pipe packet size: 0
Max global variable size: 3221225472
Max global variable preferred total size: 4294967296
Max read/write image args: 64
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 0x7f7273868198
Name: gfx803
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 1.1 (HSA,LC)
Profile: FULL_PROFILE
Version: OpenCL 1.2
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_media_ops cl_amd_media_ops2 cl_khr_subgroups cl_khr_depth_images
</pre>
<br />
<br />Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com1tag:blogger.com,1999:blog-1855371117255475765.post-30715375285296039962016-09-18T17:51:00.000+03:002016-09-18T17:51:58.370+03:00NVidia Pascal's GPU architecture most exciting feature<div dir="ltr" style="text-align: justify;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://devblogs.nvidia.com/parallelforall/wp-content/uploads/2016/04/pascal_8_unified_memory-300x213.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="141" src="https://devblogs.nvidia.com/parallelforall/wp-content/uploads/2016/04/pascal_8_unified_memory-300x213.png" width="200" /></a></div>
Few months ago NVidia announced the Pascal GPU architecture and more specifically the GP100 GPU. This is a monstrous GPU with more than 15 billion transistors built using a 16nm FinFET fabrication. Though, the alleged performance numbers are arguably impressive (10.6 TFlops SP, 5.3 TFlops DP) I personally think that this is not the most impressive feature of this GPU.<br />
<br /></div>
<div dir="ltr" style="text-align: justify;">
The most impressive feature I found on as advertised is the unified memory support. In CUDA 6 and CC3.0 & CC3.5 devices (Kepler architecture) this term had been first introduced. But it didn't actually provide any real benefits at the time other than programming laziness. In particular, the run-time took care of moving the whole data to/from the GPU memory whenever it was used on either the host or GPU. The GP100 memory unification seems far more complete as according to specifications it seems to take memory unification to the next level. It supports data migration at the granularity of memory page! This means that programmer is able to "see" the whole system memory and the run-time takes care of which memory page should be moved at the time it is actually needed. This is a great feature! It allows porting CPU programs to CUDA without caring which data will actually be accessed.<br />
<br />
For instance, imagine having a huge tree or graph structure and and you have a GPU kernel that needs to access just a few nodes on it without knowing which beforehand. Using the Kepler memory unification feature would require copying the whole structure from the host to GPU memory which could potentially cannibalize performance. The Pascal memory unification would actually copy only the memory pages residing on the accessed nodes, instead. This releases programmer from a great pain and that's why I think this is the most exciting feature.<br />
<br /></div>
<div dir="ltr" style="text-align: justify;">
I really hope this feature will be eventually supported on consumer GPU variants and stays not just an HPC feature for in Tesla products. I also hope that AMD will also support such a feature in its emerging ROCm platform.</div>
<div dir="ltr" style="text-align: justify;">
<br /></div>
<div dir="ltr">
Resources:</div>
<div dir="ltr">
<a href="https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf" target="_blank">NVidia Tesla P100 whitepaper</a></div>
Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-88051525698798788282016-05-19T11:42:00.001+03:002016-05-19T11:42:53.044+03:00mixbench on an AMD Fiji GPU<div style="text-align: justify;">
Recently, I had the quite pleasant opportunity to be granted with the Radeon R9 Nano GPU card. This card features the Fiji GPU and as such it seems to be a compute beast as it features 4096 shader units and HBM memory with bandwidth reaching to 512GB/sec. If one considers the card's remarkably small size and low power consumption, this card proves to be a great and efficient compute device for handling parallel compute tasks via OpenCL (or <a href="https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP" target="_blank">HIP</a>, but more on this on a later post).<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_RG2iL2JVts_ni6qkJgnUPLx69PmICRnpb_Ip8vaGIHnCTPk5pgtGnmJDL4Wqhak4T2VO6JiU5Uca-VmdXiOA-w1Un6vYnJzMur6pR3069FZWgl0hzzmjeb3lwpa_bNGDPSXw0PRXpqcB/s1600/nano-donation.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="287" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_RG2iL2JVts_ni6qkJgnUPLx69PmICRnpb_Ip8vaGIHnCTPk5pgtGnmJDL4Wqhak4T2VO6JiU5Uca-VmdXiOA-w1Un6vYnJzMur6pR3069FZWgl0hzzmjeb3lwpa_bNGDPSXw0PRXpqcB/s400/nano-donation.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">AMD R9 Nano GPU card</td></tr>
</tbody></table>
<br />
One of the first experiments I tried on it was the <a href="http://parallelplusplus.blogspot.gr/2015/07/mixbench-gpu-performance-benchmark-for.html" target="_blank">mixbench</a> microbenchmark tool, of course. Expressing the execution results via gnuplot in the memory bandwidth/compute throughput plane is depicted here:</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbYckLEPe4vQqfOsudYol0_fVgROYLiH_nULvwb9ONaHmB4UlNyzbHbOEUvm1W5IfcCOpBJtzVbDjn6mDParBrreUaX4Ma6otk8wzv11qz4dNQ4mt3iyaJceBcETnHh4nsjAPcpzewndUn/s1600/4e9bf412-92cd-4997-9425-87197c25c203.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbYckLEPe4vQqfOsudYol0_fVgROYLiH_nULvwb9ONaHmB4UlNyzbHbOEUvm1W5IfcCOpBJtzVbDjn6mDParBrreUaX4Ma6otk8wzv11qz4dNQ4mt3iyaJceBcETnHh4nsjAPcpzewndUn/s640/4e9bf412-92cd-4997-9425-87197c25c203.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">mixbench-ocl-ro as executed on the R9 Nano</td></tr>
</tbody></table>
GPU performance effectively approaches 8 TeraFlops of single precision compute performance on heavily compute intensive kernels whereas it exceeds 450GB/sec memory bandwidth on memory oriented kernels.<br />
<br /></div>
<div style="text-align: justify;">
For anyone interested in trying mixbench on their CUDA/OpenCL/HIP GPU please follow the link to github:</div>
<div style="text-align: justify;">
<a href="https://github.com/ekondis/mixbench">https://github.com/ekondis/mixbench</a><br />
<br />
Here is an example of execution on Ubuntu Linux:</div>
<div style="text-align: justify;">
<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/gz4Iuuos4VE/0.jpg" frameborder="0" height="426" src="https://www.youtube.com/embed/gz4Iuuos4VE?feature=player_embedded" width="512"></iframe></div>
<br />
<br /></div>
<div style="text-align: justify;">
<u>Acknowledgement</u>: I would like to greatly thank the Radeon Open Compute department of AMD for kindly supplying the Radeon R9 Nano GPU card for the support of our research.</div>
Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-54609931189827751652016-03-19T19:31:00.000+02:002016-04-17T13:07:10.806+03:00Raspberry PI 3 is here!<div style="text-align: justify;">
Some days ago the Raspberry PI 3 arrived home as I had ordered one when I heard of its launch. It's certainly a faster PI than the PI 2 due to the ARM Cortex-A53 cores. More or less the +50% performance ratio is true, depending on the application of course. There are some other additions as well like WiFi and bluetooth.</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgApCh93qns48aEcKP404kgkCBKj-CJjqg9bepYAUtdHoTOb0wd4_0n0_OARCTTQ5galV2KAtdzOzmOWoFQIoDo5T_FbMBJ4fplcEhJqOE-v6zlZLBOWB-hfSf97oWWo5t82T2BOILC0r5a/s1600/raspi3-1.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgApCh93qns48aEcKP404kgkCBKj-CJjqg9bepYAUtdHoTOb0wd4_0n0_OARCTTQ5galV2KAtdzOzmOWoFQIoDo5T_FbMBJ4fplcEhJqOE-v6zlZLBOWB-hfSf97oWWo5t82T2BOILC0r5a/s400/raspi3-1.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The Raspberry PI 3</td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8MF2eTdrfYmaQs-lTdsKIAdga6gOWXB8aB1qGEIfisK1LLrIT1vPZSDQ0Es2Wqsr_gUyrQwp7OOCvvrO9fY0tDh_Wt5H_hCkASTyF-UhJMuMVBQjXYriHK8d0519QizE02ptjGb6_HVl0/s1600/raspi3-2.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8MF2eTdrfYmaQs-lTdsKIAdga6gOWXB8aB1qGEIfisK1LLrIT1vPZSDQ0Es2Wqsr_gUyrQwp7OOCvvrO9fY0tDh_Wt5H_hCkASTyF-UhJMuMVBQjXYriHK8d0519QizE02ptjGb6_HVl0/s400/raspi3-2.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A closer look of the PI 3</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div style="text-align: justify;">
As usual, I am providing some <i>nbench</i> execution results. These are consistent with the <complete id="goog_546913451">+50% performance claim. For those interested I had published <a href="http://parallelplusplus.blogspot.gr/2015/02/raspberry-pi-2-is-here.html" target="_blank">nbench results on the PI 2</a> in the past.</complete></div>
<br />
<pre class="prettyprint lang-html">BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)
TEST : Iterations/sec. : Old Index : New Index
: : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT : 654.04 : 16.77 : 5.51
STRING SORT : 72.459 : 32.38 : 5.01
BITFIELD : 1.9972e+08 : 34.26 : 7.16
FP EMULATION : 134.28 : 64.44 : 14.87
FOURIER : 6677.3 : 7.59 : 4.27
ASSIGNMENT : 10.381 : 39.50 : 10.25
IDEA : 2740.7 : 41.92 : 12.45
HUFFMAN : 1008.9 : 27.98 : 8.93
NEURAL NET : 9.8057 : 15.75 : 6.63
LU DECOMPOSITION : 365.38 : 18.93 : 13.67
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 34.272
FLOATING-POINT INDEX: 13.131
Baseline (MSDOS*) : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU : 4 CPU ARMv7 Processor rev 4 (v7l)
L2 Cache :
OS : Linux 4.1.18-v7+
C compiler : gcc-4.9
libc : libc-2.19.so
MEMORY INDEX : 7.162
INTEGER INDEX : 9.769
FLOATING-POINT INDEX: 7.283
Baseline (LINUX) : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.
</pre>
<br />
<div style="text-align: justify;">
As I crossed <a href="http://www.phoronix.com/scan.php?page=news_item&px=RPi2-vs-RPi3-SoC-Temperature" target="_blank">some reports on temperature issues of PI 3</a> I wanted to execute some experiments on power consumption of the PI 3. I used a power meter on which I plugged the power supply unit feeding the PI. I run a few experiments and I got the following power consumption ratings:</div>
<br />
<style>
table.brd, .brd td{
border:1px solid;
border-spacing: 0px;
border-collapse: collapse;
}
</style>
<br />
<table class="brd">
<tbody>
<tr><th>PI running state</th><th>Power consumption</th></tr>
<tr><td>Idle</td><td>1.4W</td></tr>
<tr><td>Single threaded benchmark</td><td>2.2W</td></tr>
<tr><td>Multithreaded benchmark</td><td>4.0W</td></tr>
<tr><td>After running "poweroff"</td><td>0.5W</td></tr>
</tbody></table>
<br />
So, for my case it doesn't seem consume to much power. However, a comparison with the PI 2 should be performed in order to have a better picture.Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-88262766645597094912015-11-22T14:10:00.001+02:002015-11-22T14:10:38.646+02:00mixbench benchmark OpenCL implementation<div style="text-align: justify;">
Four and a half months ago I posted an article about <a href="http://parallelplusplus.blogspot.gr/2015/07/mixbench-gpu-performance-benchmark-for.html" target="_blank">mixbench</a> benchmark. This benchmark was used to assess performance of an artificial kernel with mixed compute and memory operations which corresponds to various operational intensities (Flops/byte ratios). The implementation was based on CUDA and therefore only NVidia GPUs could be used.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Now, I've ported the CUDA implementation to OpenCL and here I provide some performance numbers on an AMD R7-260X. Here is the output when using 128MB memory buffer:</div>
<div style="text-align: justify;">
<br /></div>
<div>
<pre class="prettyprint lang-html">mixbench-ocl (compute & memory balancing GPU microbenchmark)
Use "-h" argument to see available options
------------------------ Device specifications ------------------------
Device: Bonaire
Driver version: 1800.11 (VM)
GPU clock rate: 1175 MHz
Total global mem: 1871 MB
Max allowed buffer: 1336 MB
OpenCL version: OpenCL 2.0 AMD-APP (1800.11)
Total CUs: 14
-----------------------------------------------------------------------
Buffer size: 128MB
Workgroup size: 256
Workitem stride: NDRange
Loading kernel source file...
Precompilation of kernels... [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>]
--------------------------------------------------- CSV data --------------------------------------------------
Single Precision ops,,,, Double precision ops,,,, Integer operations,,,
Flops/byte, ex.time, GFLOPS, GB/sec, Flops/byte, ex.time, GFLOPS, GB/sec, Iops/byte, ex.time, GIOPS, GB/sec
0.000, 273.95, 0.00, 62.71, 0.000, 519.39, 0.00, 66.15, 0.000, 258.30, 0.00, 66.51
0.065, 252.12, 4.26, 66.01, 0.032, 506.86, 2.12, 65.67, 0.065, 252.08, 4.26, 66.02
0.133, 241.49, 8.89, 66.69, 0.067, 487.11, 4.41, 66.13, 0.133, 241.59, 8.89, 66.67
0.207, 235.72, 13.67, 66.05, 0.103, 474.25, 6.79, 65.66, 0.207, 236.35, 13.63, 65.87
0.286, 225.46, 19.05, 66.67, 0.143, 453.92, 9.46, 66.23, 0.286, 225.05, 19.08, 66.80
0.370, 219.59, 24.45, 66.01, 0.185, 442.80, 12.12, 65.47, 0.370, 220.15, 24.39, 65.84
0.462, 209.03, 30.82, 66.78, 0.231, 421.14, 15.30, 66.29, 0.462, 209.10, 30.81, 66.76
0.560, 203.60, 36.92, 65.92, 0.280, 409.07, 18.37, 65.62, 0.560, 203.99, 36.85, 65.80
0.667, 192.80, 44.55, 66.83, 0.333, 388.95, 22.09, 66.26, 0.667, 193.27, 44.44, 66.67
0.783, 187.81, 51.46, 65.75, 0.391, 378.34, 25.54, 65.27, 0.783, 187.86, 51.44, 65.73
0.909, 177.09, 60.63, 66.70, 0.455, 357.29, 30.05, 66.12, 0.909, 177.18, 60.60, 66.66
1.048, 171.62, 68.82, 65.69, 0.524, 345.04, 34.23, 65.35, 1.048, 171.59, 68.83, 65.70
1.200, 160.76, 80.15, 66.79, 0.600, 325.75, 39.55, 65.92, 1.200, 160.57, 80.24, 66.87
1.368, 155.33, 89.86, 65.67, 0.684, 313.23, 44.56, 65.13, 1.368, 155.30, 89.88, 65.68
1.556, 144.48, 104.05, 66.89, 0.778, 293.56, 51.21, 65.84, 1.556, 144.62, 103.95, 66.82
1.765, 139.33, 115.60, 65.51, 0.882, 281.60, 57.20, 64.82, 1.765, 139.33, 115.60, 65.50
2.000, 128.79, 133.40, 66.70, 1.000, 261.47, 65.70, 65.70, 2.000, 128.86, 133.32, 66.66
2.267, 117.57, 155.26, 68.50, 1.133, 235.53, 77.50, 68.38, 2.267, 117.49, 155.36, 68.54
2.571, 112.96, 171.10, 66.54, 1.286, 246.34, 78.46, 61.02, 2.571, 112.65, 171.57, 66.72
2.923, 101.62, 200.77, 68.68, 1.462, 257.16, 79.33, 54.28, 2.923, 101.13, 201.72, 69.01
3.333, 96.64, 222.22, 66.67, 1.667, 268.00, 80.13, 48.08, 3.333, 95.65, 224.51, 67.35
3.818, 83.93, 268.65, 70.36, 1.909, 278.84, 80.86, 42.36, 3.818, 72.92, 309.24, 80.99
4.400, 80.58, 293.16, 66.63, 2.200, 289.68, 81.55, 37.07, 4.400, 73.59, 321.00, 72.95
5.111, 67.67, 364.96, 71.41, 2.556, 300.58, 82.16, 32.15, 5.111, 74.28, 332.49, 65.05
6.000, 64.45, 399.83, 66.64, 3.000, 311.43, 82.75, 27.58, 6.000, 75.29, 342.26, 57.04
7.143, 50.01, 536.76, 75.15, 3.571, 322.26, 83.30, 23.32, 7.143, 76.25, 352.04, 49.29
8.667, 48.34, 577.52, 66.64, 4.333, 333.09, 83.81, 19.34, 8.667, 77.26, 361.33, 41.69
10.800, 33.47, 866.12, 80.20, 5.400, 343.93, 84.29, 15.61, 10.800, 78.25, 370.48, 34.30
14.000, 32.22, 932.99, 66.64, 7.000, 354.77, 84.74, 12.11, 14.000, 79.26, 379.32, 27.09
19.333, 20.68, 1505.69, 77.88, 9.667, 376.91, 82.62, 8.55, 19.333, 80.27, 387.93, 20.07
30.000, 19.37, 1663.32, 55.44, 15.000, 378.17, 85.18, 5.68, 30.000, 81.26, 396.41, 13.21
62.000, 18.46, 1802.66, 29.08, 31.000, 389.93, 85.36, 2.75, 62.000, 33.57, 991.64, 15.99
inf, 16.68, 2059.77, 0.00, inf, 397.94, 86.34, 0.00, inf, 33.54, 1024.43, 0.00
---------------------------------------------------------------------------------------------------------------
</pre>
<br /></div>
<div>
<div style="text-align: justify;">
And here is "memory bandwidth" to "compute throughput" plot on the single precision floating point experiment results:</div>
</div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixsZz0rVeUk4N0JBqGQkFPEd4GDBugW4J2UvqV-xhFmHkqFgJpGD1Ga4dKRbc-vy-IL6UJ_do4r5fGzY1MtYSKhfAM_ehePv90YQsm7MEcNCxiipqCW2nyccmv-6pzKovNyP-2j-oFodlj/s1600/test2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixsZz0rVeUk4N0JBqGQkFPEd4GDBugW4J2UvqV-xhFmHkqFgJpGD1Ga4dKRbc-vy-IL6UJ_do4r5fGzY1MtYSKhfAM_ehePv90YQsm7MEcNCxiipqCW2nyccmv-6pzKovNyP-2j-oFodlj/s640/test2.png" width="640" /></a></div>
<br />
<div style="text-align: justify;">
The source code of mixbench is freely provided, hosted at a github repository and you can find it at <a href="https://github.com/ekondis/mixbench" target="_blank">https://github.com/ekondis/mixbench</a>. I would be happy to include results from other GPUs as well. Please try this tool and let me know about your extracted results and thoughts.</div>
<br /></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
</div>
Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-68296527223674856612015-11-16T15:28:00.000+02:002015-11-16T15:28:23.233+02:00OpenCL 2.1 and SPIR-V standards released!I've just noticed that the OpenCL 2.1 and SPIR-V standards were released today!<br />
<br />
I just hope that vendors will not take to long to introduce up to date SDKs and drivers.<br />
<br />
<a href="https://www.khronos.org/registry/cl/" target="_blank">OpenCL 2.1</a><br />
<a href="https://www.khronos.org/registry/spir-v/" target="_blank">SPIR-V</a>Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-69705572116977363712015-10-28T18:18:00.000+02:002015-10-28T18:18:25.453+02:00OpenCL on the Raspberry PI 2<div style="text-align: justify;">
OpenCL can be enabled on the Raspberry PI 2! However, you'll be disappointed to know that I'm referring to the utilization of its CPU, not GPU. Nevertheless, running OpenCL on the PI could be useful for development and experimentation on an embedded platform.</div>
<br />
<div style="text-align: justify;">
You'll need the <a href="http://portablecl.org/" target="_blank">POCL implementation (Portable OpenCL)</a> which relies on the <a href="http://llvm.org/" target="_blank">LLVM</a>. I used the just released v0.12 of POCL and the Raspbian Jessie supplied LLVM v.3.5.</div>
<br />
<div style="text-align: justify;">
After compiling and installing POCL with the natural procedure (you might need to install some libraries from the raspbian repositories, e.g. libhwloc-dev, libclang-dev or mesa-common-dev) you'll be able to compile OpenCL programs on the PI. I tested the <a href="https://github.com/krrishnarraj/clpeak" target="_blank">clpeak benchmark program</a> but the compute results were rather poor:</div>
<br />
<pre class="prettyprint lang-html">Platform: Portable Computing Language
Device: pthread
Driver version : 0.12-pre (Linux ARM)
Compute units : 4
Clock frequency : 900 MHz
Global memory bandwidth (GBPS)
float : 0.85
float2 : 0.87
float4 : 0.76
float8 : 0.75
float16 : 0.81
Single-precision compute (GFLOPS)
float : 0.03
float2 : 0.03
float4 : 0.03
float8 : 0.03
float16 : 0.03
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 0.79
enqueueReadBuffer : 0.69
enqueueMapBuffer(for read) : 12427.57
memcpy from mapped ptr : 0.69
enqueueUnmap(after write) : 18970.70
memcpy to mapped ptr : 0.70
Kernel launch latency : 190270.91 us
</pre>
<br />
<div style="text-align: justify;">
In addition, the integer benchmark could not be executed for some reason. However, memory bandwidth result was decent and using a personal benchmark tool I could measure more than 1.4GB/sec memory bandwidth which is really nice for a PI!</div>
<br />
<br />Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-30787184666819064712015-07-04T13:08:00.001+03:002015-11-22T14:10:28.907+02:00mixbench: A GPU benchmark for mixed compute/transfer bound kernels<div style="text-align: justify;">
I have just released <a href="https://github.com/ekondis/mixbench" target="_blank">mixbench on github</a>. It is a benchmark tool which assesses performance bounds on GPUs (compute or memory bound) under mixed workloads. Unfortunately, it's currently implemented on CUDA so only NVidia GPUs can be used. The compute part can be SP Flops, DP Flops or Int ops and the memory part is global memory traffic. Running the multiple experiments in a wide range of operational intensity values allows to examine the performance of GPUs under different kernel characteristics.</div>
<br />
Running the program under a GTX-480 gives the following output:<br />
<br />
<div class="separator" style="clear: both;">
</div>
<pre class="prettyprint lang-html">mixbench (compute & memory balancing GPU microbenchmark)
------------------------ Device specifications ------------------------
Device: GeForce GTX 480
CUDA driver version: 5.50
GPU clock rate: 1401 MHz
Memory clock rate: 924 MHz
Memory bus width: 384 bits
WarpSize: 32
L2 cache size: 768 KB
Total global mem: 1535 MB
ECC enabled: No
Compute Capability: 2.0
Total SPs: 480 (15 MPs x 32 SPs/MP)
Compute throughput: 1344.96 GFlops (theoretical single precision FMAs)
Memory bandwidth: 177.41 GB/sec
-----------------------------------------------------------------------
Total GPU memory 1610285056, free 1195106304
Buffer size: 256MB
Trade-off type:compute with global memory (block strided)
---- EXCEL data ----
Operations ratio ; Single Precision ops ;;; Double precision ops ;;; Integer operations
compute/memory ; Time; GFLOPS; GB/sec; Time; GFLOPS; GB/sec; Time; GIOPS; GB/sec
0/32 ; 240.531; 0.00; 142.85; 475.150; 0.00; 144.63; 240.205; 0.00; 143.04
1/31 ; 233.548; 9.20; 142.52; 460.193; 4.67; 144.66; 233.484; 9.20; 142.56
2/30 ; 225.249; 19.07; 143.01; 445.144; 9.65; 144.73; 225.235; 19.07; 143.02
3/29 ; 218.552; 29.48; 142.48; 430.575; 14.96; 144.64; 218.745; 29.45; 142.35
4/28 ; 210.345; 40.84; 142.93; 415.425; 20.68; 144.74; 210.091; 40.89; 143.10
5/27 ; 203.132; 52.86; 142.72; 400.472; 26.81; 144.78; 203.275; 52.82; 142.62
6/26 ; 194.468; 66.26; 143.56; 385.434; 33.43; 144.86; 194.314; 66.31; 143.67
7/25 ; 187.470; 80.19; 143.19; 370.915; 40.53; 144.74; 187.475; 80.18; 143.18
8/24 ; 175.115; 98.11; 147.16; 355.723; 48.30; 144.89; 175.132; 98.10; 147.14
9/23 ; 171.760; 112.53; 143.78; 341.353; 56.62; 144.70; 171.920; 112.42; 143.65
10/22 ; 163.397; 131.43; 144.57; 326.007; 65.87; 144.92; 163.252; 131.54; 144.70
11/21 ; 155.797; 151.62; 144.73; 311.655; 75.80; 144.70; 155.814; 151.61; 144.71
12/20 ; 146.573; 175.82; 146.51; 296.386; 86.95; 144.91; 146.662; 175.71; 146.42
13/19 ; 138.853; 201.06; 146.93; 281.757; 99.08; 144.81; 138.941; 200.93; 146.83
14/18 ; 129.727; 231.75; 148.98; 266.401; 112.86; 145.10; 129.744; 231.72; 148.97
15/17 ; 121.228; 265.72; 150.57; 251.283; 128.19; 145.28; 121.339; 265.47; 150.43
16/16 ; 120.065; 286.18; 143.09; 235.740; 145.75; 145.75; 120.122; 286.04; 143.02
17/15 ; 111.357; 327.84; 144.64; 219.472; 166.34; 146.77; 111.528; 327.34; 144.41
18/14 ; 106.430; 363.19; 141.24; 231.498; 166.98; 129.87; 106.541; 362.82; 141.10
19/13 ; 96.118; 424.50; 145.22; 243.534; 167.54; 114.63; 96.494; 422.85; 144.66
20/12 ; 89.602; 479.34; 143.80; 256.247; 167.61; 100.57; 89.642; 479.13; 143.74
21/11 ; 81.976; 550.13; 144.08; 269.055; 167.61; 87.80; 83.091; 542.74; 142.15
22/10 ; 76.066; 621.10; 141.16; 282.898; 167.00; 75.91; 76.068; 621.08; 141.15
23/ 9 ; 65.631; 752.57; 147.24; 295.743; 167.01; 65.35; 76.895; 642.33; 125.67
24/ 8 ; 60.809; 847.57; 141.26; 307.479; 167.62; 55.87; 80.099; 643.45; 107.24
25/ 7 ; 52.032; 1031.82; 144.45; 321.449; 167.02; 46.76; 83.296; 644.53; 90.23
26/ 6 ; 48.321; 1155.49; 133.33; 334.305; 167.02; 38.54; 86.519; 645.35; 74.46
27/ 5 ; 49.519; 1170.90; 108.42; 347.157; 167.02; 30.93; 89.729; 646.19; 59.83
28/ 4 ; 50.704; 1185.90; 84.71; 360.013; 167.02; 23.86; 92.891; 647.31; 46.24
29/ 3 ; 52.024; 1197.09; 61.92; 372.867; 167.02; 17.28; 96.115; 647.94; 33.51
30/ 2 ; 53.377; 1206.97; 40.23; 385.722; 167.02; 11.13; 99.328; 648.61; 21.62
31/ 1 ; 53.437; 1245.80; 20.09; 397.203; 167.60; 5.41; 101.247; 657.52; 10.61
32/ 0 ; 53.558; 1283.08; 0.00; 410.012; 167.60; 0.00; 102.494; 670.47; 0.00
--------------------
</pre>
<div>
<br /></div>
The results for single and double precision Flops are illustrated in the following charts:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhP0mEZiHNvEMahqAUDvIsnQEv-2l950BaKbyvD6NRIC5iEVfHJdMIGgXKGQn4iv0ZqTgCxKYISunyOcB9JX92j7OH2P2T04ZcbNkgbAbs3Ca41UZ3LKGW_cfMrzvxzESTg2pIojgBkcvFw/s1600/mixbench-perf-sp.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="308" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhP0mEZiHNvEMahqAUDvIsnQEv-2l950BaKbyvD6NRIC5iEVfHJdMIGgXKGQn4iv0ZqTgCxKYISunyOcB9JX92j7OH2P2T04ZcbNkgbAbs3Ca41UZ3LKGW_cfMrzvxzESTg2pIojgBkcvFw/s400/mixbench-perf-sp.png" width="400" /></a></td></tr>
<tr><td class="tr-caption">% of peak SP Flops and memory bandwidth performance related with the operational intensity</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEC9wGQeHsEStlo_vgNH6sDg9SHB7J5INCRDqpyxKBcmp-ExzPayi_SoeJ4_VW9pA71ZY6J-oSWpAAKtcVnSCXmt5kgX0F8qVjBEky1htZkldnau8psvCYlBxAwuT2bwxSDbWiR5VyIIbT/s1600/mixbench-perf-dp.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="308" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEC9wGQeHsEStlo_vgNH6sDg9SHB7J5INCRDqpyxKBcmp-ExzPayi_SoeJ4_VW9pA71ZY6J-oSWpAAKtcVnSCXmt5kgX0F8qVjBEky1htZkldnau8psvCYlBxAwuT2bwxSDbWiR5VyIIbT/s400/mixbench-perf-dp.png" width="400" /></a></td></tr>
<tr><td class="tr-caption">% of peak DP Flops and memory bandwidth performance related with the operational intensity</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: left;">
</div>
<div style="text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6Xs2z-pozslEp2gwOA7VXscxd01Itzx-PdokbTuyUTZb8_zcSwfyGVZW5mey-KYUY89_z0mGBbM5FmsM-0zOs5ucOZ9ZgXSoWkbvvOOKMWdr8KoCjMD6FKKtOIIaiceedzZh20H79EaQh/s1600/mixbench-balance-sp.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="308" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6Xs2z-pozslEp2gwOA7VXscxd01Itzx-PdokbTuyUTZb8_zcSwfyGVZW5mey-KYUY89_z0mGBbM5FmsM-0zOs5ucOZ9ZgXSoWkbvvOOKMWdr8KoCjMD6FKKtOIIaiceedzZh20H79EaQh/s400/mixbench-balance-sp.png" width="400" /></a></td></tr>
<tr><td class="tr-caption">Compute throughput (SP Flops) vs memory bandwidth</td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjavmX8Xw1MAkmUHZ-Jq7RdGJBD7hI2lAjwoCetDVcaDdo3GTf1UH3Xgw3f10FdTE42JnunDum-1VM4HrfNmKoGhM6FgcSyf2Hw-wQdaXbEVMnqDINWnY38FOdZflZTDpFmH5p_PwnwIOEZ/s1600/mixbench-balance-dp.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="307" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjavmX8Xw1MAkmUHZ-Jq7RdGJBD7hI2lAjwoCetDVcaDdo3GTf1UH3Xgw3f10FdTE42JnunDum-1VM4HrfNmKoGhM6FgcSyf2Hw-wQdaXbEVMnqDINWnY38FOdZflZTDpFmH5p_PwnwIOEZ/s400/mixbench-balance-dp.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Compute throughput (DP Flops) vs memory bandwidth</td></tr>
</tbody></table>
<br />
<h3>
Publication:</h3>
Since this work was initially part of published research <u>please cite the following publication</u> where applicable:<br />
<br />
Konstantinidis, E.; Cotronis, Y., "A Practical Performance Model for Compute and Memory Bound GPU Kernels," <i>Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on</i> , vol., no., pp.651,658, 4-6 March 2015<br />
doi: 10.1109/PDP.2015.51<br />
URL: <a href="http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7092788&isnumber=7092002">http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7092788&isnumber=7092002</a>Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-42814972428233020272015-06-15T19:01:00.001+03:002015-06-15T19:01:22.226+03:00IWOCL 2015 presentations available online<br />
<div style="text-align: justify;">
IWOCL 2015 (International Workshop on OpenCL) presentations are <a href="http://www.iwocl.org/conf-2015/iwocl-2015-presentation-downloads/" target="_blank">available online</a> for free download. It's a good thing that the organizers provide them not long after the conference takes place.</div>
<br /><br />
For more info about IWOCL: <a href="http://www.iwocl.org/" target="_blank">Link</a>Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-69242769173061155632015-06-04T19:54:00.000+03:002015-06-04T19:54:02.313+03:00Hot off the press!<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0HOaN7zNr01LEJA2plwDRV1ue93yi-vA9gphMTULritNaX9irFWMtiGxwlkP7O_qbxp9SS3HYTmxyi4omTqu8eRR-pcWJj7_w7QdGSOeOUKgwVbFY4XPKUDZ4jh65Y1sG8mzp9sK3g0ZH/s1600/hetcomp-ocl2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0HOaN7zNr01LEJA2plwDRV1ue93yi-vA9gphMTULritNaX9irFWMtiGxwlkP7O_qbxp9SS3HYTmxyi4omTqu8eRR-pcWJj7_w7QdGSOeOUKgwVbFY4XPKUDZ4jh65Y1sG8mzp9sK3g0ZH/s640/hetcomp-ocl2.jpg" width="620" /></a></div>
It has just arrived!<br />
<br />
<a href="https://store.elsevier.com/product.jsp?isbn=9780128014141" target="_blank">Link</a>Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-70610550937086381142015-04-19T22:52:00.001+03:002015-04-19T22:52:53.576+03:00About SPIR-V and OpenCL 2.1<div style="text-align: justify;">
Less than a couple of months ago the provisional release of OpenCL 2.1 and SPIR-V byte code was announced. SPIR-V is now defined completely from the ground-up and is not a patched LLVM derivative anymore. The real good news is that Khronos will push extended language features (C++) to be supported through an offline compiler.</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPv_NLwxByfbn4dtT8h9h2e8gkW4BxkjJDOukB-_-bM7bo7L17Da7IKjaiegfeFUjgXA3pusetP4eSRLrBaph2YgIo-XcgMP0drXpGYOfVd8V8hc5Wspb4gtf7XP1PnFfRqYeGYgN4Mnvu/s1600/khornos-opencl-flowchart.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPv_NLwxByfbn4dtT8h9h2e8gkW4BxkjJDOukB-_-bM7bo7L17Da7IKjaiegfeFUjgXA3pusetP4eSRLrBaph2YgIo-XcgMP0drXpGYOfVd8V8hc5Wspb4gtf7XP1PnFfRqYeGYgN4Mnvu/s1600/khornos-opencl-flowchart.png" height="235" width="400" /></a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Let me explain. I believe that the source compiler should never have been a part of the device driver. Using an offline compiler and feeding the OpenCL runtime with kernels in a bytecode format would allow the runtime to be more lightweight, which is very important especially for mobile devices. It would also prove to be less error prone and more portable. My sense is that OpenCL consumption by the runtime will be obsolete in future OpenCL releases. I always found weird the way by supplying text source code to the library during runtime. This is a change for good and it will allow vendors releasing their implementations faster.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
I also want to note that NVidia, after all these years of stagnation, silently released drivers supporting OpenCL 1.2 which allows us dreaming a future driver supporting SPIR-V. They do not support Fermi through (only Kepler & Maxwell).</div>
Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-63539285294024663492015-03-17T23:18:00.000+02:002015-03-17T23:18:58.829+02:00ISA reference guide for Volcanic Islands architecture<div style="text-align: justify;">
A new GPU ISA manual is available for AMD GCN 3rd generation GPUs. This is probably regarding the Tonga GPU and Carrizo APU as it mentions that context switching is an additional capability to the architecture.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
You may download it here:</div>
<br />
<a href="http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/07/AMD_GCN3_Instruction_Set_Architecture.pdf" target="_blank">http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/07/AMD_GCN3_Instruction_Set_Architecture.pdf</a>Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-37080376597066458932015-02-28T00:02:00.000+02:002015-02-28T00:02:00.073+02:00Maxwell for the masses (GM206)<div class="separator" style="clear: both; text-align: justify;">
As you probably already know the mainstream version of Maxwell GPU has already been released in the form of GM206. The graphics card bearing the chip is the GTX-960. The card seems to be pretty efficient and a significant improvement over Kepler especially in compute applications which is the one aspect that I'm particularly interested in. There has been some controversy of course due to its short memory bus (128bit) which entails a peak memory bandwidth of 112GB/sec. However, the larger cache memory should help alleviating this bottleneck.</div>
<div class="separator" style="clear: both; text-align: justify;">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivAua6v7kmUKFvAk0FFho9i_vCcCmNQsRLuVmtRnHqeMtIm2-IqyUzkv1VpEmAb5oiAXVwDqK6viDZkA20H5rDOLzOl9HaHpXKAPaP9f1kkKlAG0EfrW_Ckku4xwjYz6T33lrC81MetW0t/s1600/mygtx960.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivAua6v7kmUKFvAk0FFho9i_vCcCmNQsRLuVmtRnHqeMtIm2-IqyUzkv1VpEmAb5oiAXVwDqK6viDZkA20H5rDOLzOl9HaHpXKAPaP9f1kkKlAG0EfrW_Ckku4xwjYz6T33lrC81MetW0t/s1600/mygtx960.jpg" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The Zotac GTX-960 AMP! edition</td></tr>
</tbody></table>
<br />
<div style="text-align: justify;">
In order to give you a taste about the compute capabilities of Maxwell I provide the results of experimenting with the OpenCL NBody example (16384 bodies) from the NVidia SDK 4.2 (the last one with OpenCL support). The GTX-960 yields a well above of 1TeraFlop performance which is impressive. I also performed executions with 3 more GPUs. All results are depicted in the chart that follows.</div>
<div style="text-align: justify;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2TQr-RjfuEtQvV4ShNMIybEkM9amTsSO7t5CS3RfVOoSZfSAjz0kD_3-hdEFS-AOrUz4_VV7YTWWLD_7iEqFPsZeKGTEwX0-JlMKqmjriPklY_lI3V0sC5fC3FZztcFrTS_F_LKi7z-vp/s1600/oclnbody-bench.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2TQr-RjfuEtQvV4ShNMIybEkM9amTsSO7t5CS3RfVOoSZfSAjz0kD_3-hdEFS-AOrUz4_VV7YTWWLD_7iEqFPsZeKGTEwX0-JlMKqmjriPklY_lI3V0sC5fC3FZztcFrTS_F_LKi7z-vp/s1600/oclnbody-bench.png" height="361" width="640" /></a></div>
<br />
The red bars represent measured performance in GFLOPs and the green ones the efficiency as the ratio measured/peak GFLOPs performance.<br />
<div style="text-align: justify;">
The Maxwell architecture seems to address many issues with compute efficiency of its predecessor. However, there are two drawbacks. First, the low memory bandwidth as mentioned above and second, the quite low compute performance in double precision operations which is set now at 1/32 ratio with regard to single precision operations.</div>
<div style="text-align: justify;">
One last observation is the quite good performance of the AMD GPU although the example application had been developed by NVidia and it's reasonable to think that it is optimized for its own GPUs. This could be one of the main reasons that they stopped supporting the OpenCL paradigm.</div>
Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com1tag:blogger.com,1999:blog-1855371117255475765.post-16601025664321010162015-02-13T22:28:00.000+02:002015-02-13T22:28:09.933+02:00Raspberry Pi 2 is here!<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjc7yWjJKn3BxSTDBsgr-mm7hzM-PCn1dbsSMPmyZ-9c5yE2EPlsMVCOEzXOu3W45kGEcbuP3Q2Jl7EMcsxGPxJIVWs6HwiV2w5miVvLyOhcy5ByT06l_7Lh98NyuQoDlkBlxb7ws70Xgmj/s1600/mypi2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjc7yWjJKn3BxSTDBsgr-mm7hzM-PCn1dbsSMPmyZ-9c5yE2EPlsMVCOEzXOu3W45kGEcbuP3Q2Jl7EMcsxGPxJIVWs6HwiV2w5miVvLyOhcy5ByT06l_7Lh98NyuQoDlkBlxb7ws70Xgmj/s1600/mypi2.jpg" /></a></div>
<br />
<div style="text-align: justify;">
Well, it's here! Raspberry PI 2 looks very similar to it's predecessor, the Raspberry PI B+, except of two things. The rather old ARM11 core is upgraded to not one but four Cortex-A7 cores (900MHz). The Cortex-A7 is an upgrade by itself as benchmarks has shown that it is 1.5-3 times faster than the old CPU core. Four CPU cores do a decent upgrade for the same power envelope and the same price ($35). And this is not all of the changes. The new PI features double the amount of RAM which now reaches to 1GB.</div>
<div style="text-align: justify;">
To summarize it is a great upgrade of the old PI. I would say that it is the most affordable 4 core computer for applying parallel programming paradigms, e.g. OpenMP.</div>
One can compare these nbench output to the original <a href="http://parallelplusplus.blogspot.gr/search/label/nbench" target="_blank">Raspberry PI nbench results</a>. Keep in your mind that nbench is a single threaded benchmark.<br />
<br />
<br />
<br />
<pre class="prettyprint lang-html">BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)
TEST : Iterations/sec. : Old Index : New Index
: : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT : 453.9 : 11.64 : 3.82
STRING SORT : 36.298 : 16.22 : 2.51
BITFIELD : 1.1028e+08 : 18.92 : 3.95
FP EMULATION : 82.381 : 39.53 : 9.12
FOURIER : 4877.8 : 5.55 : 3.12
ASSIGNMENT : 7.1713 : 27.29 : 7.08
IDEA : 1364.7 : 20.87 : 6.20
HUFFMAN : 663.8 : 18.41 : 5.88
NEURAL NET : 5.7769 : 9.28 : 3.90
LU DECOMPOSITION : 224.96 : 11.65 : 8.42
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 20.419
FLOATING-POINT INDEX: 8.434
Baseline (MSDOS*) : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU : 4 CPU ARMv7 Processor rev 5 (v7l)
L2 Cache :
OS : Linux 3.18.5-v7+
C compiler : gcc-4.7
libc : /lib/arm-linux-gnueabihf/libgcc_s.so.1
MEMORY INDEX : 4.125
INTEGER INDEX : 5.970
FLOATING-POINT INDEX: 4.678
Baseline (LINUX) : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.
</pre>
Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-67616464831347696432014-12-20T13:47:00.001+02:002014-12-20T13:47:45.575+02:00Workgroup reduction function evaluation. How well do they perform?<div dir="ltr" style="text-align: justify;">
The initial AMD driver for OpenCL 2.0 has already been <a href="http://support.amd.com/en-us/download" target="_blank">released</a>. The latest version of the OpenCL parallel programming API is quite interesting as it supports shared virtual memory, dynamic parallelism, pipes and other features. Among the rest of them are the workgroup and sub-workgroup functions which are abstractions that on one hand simplify parallel primitive operations such as broadcast, scan and reduction operations and provide the opportunity for the compiler for further optimizations on the other.</div>
<div dir="ltr">
<br /></div>
<div dir="ltr" style="text-align: justify;">
In order to evaluate the workgroup function performance I developed a test case experiment for a reduction of the sum 1+2+3+...+N. Reduction is implemented in 3 different ways with 3 kernels. The first kernel is performed in the classical manner with shared memory. The last performs the reduction with the workgroup reduction function. The intermediate kernel uses shared memory for the inter-wavefront stages and the subgroup reduction operation for the intra-wavefront stage.</div>
<div dir="ltr">
<br /></div>
<div dir="ltr" style="text-align: justify;">
The results seem somehow disappointing. The execution configuration is a 64bit Linux system, with an R7-260X GPU. The results are as follows:</div>
<div dir="ltr">
<br /></div>
<pre class="prettyprint lang-html">Workgroup and sub-workgroup OpenCL 2.0 function evaluation test case
Platform/Device selection
Total platforms: 1
AMD Accelerated Parallel Processing
1. Bonaire/Advanced Micro Devices, Inc.
2. Intel(R) Pentium(R) 4 CPU 3.06GHz/GenuineIntel
Select device index:
Device info
Platform: AMD Accelerated Parallel Processing
Device: Bonaire
Driver version: 1642.5 (VM)
OpenCL version: OpenCL 2.0 AMD-APP (1642.5)
Great! OpenCL 2.0 is supported :)
Building kernel with options "-cl-std=CL2.0 -cl-uniform-work-group-size -DK3 -DK2 -DWAVEFRONT_SIZE=64"
1. Shared memory only kernel
Executing...Done!
Output: 2147450880 / Time: 0.089481 msecs (0.732401 billion elements/second)
PASSED!
2. Hybrid kernel via subgroup functions
Executing...Done!
Output: 2147450880 / Time: 0.215851 msecs (0.303617 billion elements/second)
Relative speed-up to kernel 1: 0.41455
PASSED!
3. Workgroup function kernel
Executing...Done!
Output: 2147450880 / Time: 0.475408 msecs (0.137852 billion elements/second)
Relative speed-up to kernel 1: 0.188219
PASSED!
</pre>
<div dir="ltr">
<br /></div>
<div dir="ltr" style="text-align: justify;">
The kernel with the workgroup function seems to perform more than 5 times slower than using just shared memory. This should definitely not be the case in a performance oriented environment like OpenCL. The performance of workgroup functions should be at least the same as using shared memory. Otherwise the workgroup functions are not essentially useful.</div>
<div dir="ltr">
<br /></div>
<div dir="ltr" style="text-align: justify;">
Unfortunately, CodeXL version 1.6 does not support static analysing of OpenCL 2.0 kernels and therefore I cannot inspect the resulting assembly code produced for the workgroup functions. According to theory swizzle operations has to be leveraged in order to optimize such operations.</div>
<div dir="ltr" style="text-align: justify;">
<br /></div>
<div dir="ltr">
Test case download link on github:</div>
<div dir="ltr">
<u><u><a href="https://github.com/ekondis/cl2-reduce-bench" target="_blank">https://github.com/ekondis/cl2-reduce-bench</a></u></u><br />
<br />
In case you notice any different results please let me know.<br />
<u><u><br /></u></u></div>
Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com6tag:blogger.com,1999:blog-1855371117255475765.post-51660595186216837742014-12-09T21:00:00.001+02:002014-12-09T21:00:49.666+02:00AMD OpenCL 2.0 SDK is available (BETA)<div class="separator" style="clear: both; text-align: center;">
<a href="http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2014/12/appsdk30opencl20img-300x167.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2014/12/appsdk30opencl20img-300x167.jpg" height="110" width="200" /></a></div>
<div style="text-align: justify;">
Eventually, the AMD SDK for OpenCL 2.0 has been released in a beta form. There are many examples exhibiting the new features. There are new accompanying documentation files though they are not written from scratch. For instance, table 2.5 in the <a href="http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide.pdf" target="_blank">optimization guide</a> refers to only HD 7xxx devices. It wouldn't be hard for AMD to add the respective tables for Rx 2xx devices. Overall, this is a significant step forward for the OpenCL 2.0 adoption.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The device driver supporting OpenCL 2.0 was also released today.</div>
<div style="text-align: justify;">
<br /></div>
<div>
For more information and download:</div>
<div>
<a href="http://developer.amd.com/community/blog/2014/12/09/amd-app-sdk-3-0-beta/" target="_blank">http://developer.amd.com/community/blog/2014/12/09/amd-app-sdk-3-0-beta/</a></div>
Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-84570786617768561552014-10-05T13:05:00.002+03:002014-10-06T20:06:01.123+03:00Least required GPU parallelism for kernel executions<div style="text-align: justify;">
GPUs require a vast number of threads per kernel invocation in order to utilize all execution units. As a first thought one should spawn at least the same number of threads as the number of available shader units (or CUDA cores or Processor Elements). However, this is not enough. The type of scheduling should be taken into account. Scheduling in Compute Units is done by multiple schedulers which in effect restricts the group of shader units in which a thread can execute. For instance the Fermi SMs consist of 32 shader units but at least 64 threads are required because 2 schedulers are evident in which the first can schedule threads only on the first group of 16 shader units and the other on the rest group. Thus a greater number of threads is required. What about the rest GPUs? What is the minimum threading required in order to enable all shader units? The answer lies on schedulers of compute units for each GPU architecture.</div>
<br />
<h3>
NVidia Fermi GPUs</h3>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://www.geeks3d.com/public/jegx/201001/fermi_gt100_sm.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="http://www.geeks3d.com/public/jegx/201001/fermi_gt100_sm.jpg" height="400" width="152" /></a></div>
<br />
<div style="text-align: justify;">
Each SM (Compute Unit) consists of 2 schedulers. Each scheduler handles 32 threads (WARP size), thus 2x32=64 threads are the minimum required per SM. For instance a GTX480 with 15 CUs requires at least 960 active threads.</div>
<br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;"><br /></span>
<br />
<span style="font-size: large;"><br /></span>
<br />
<h3>
</h3>
<h3>
NVidia Kepler GPUs</h3>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://regmedia.co.uk/2012/05/17/nvidia_kepler2_smx_block_diagram.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="http://regmedia.co.uk/2012/05/17/nvidia_kepler2_smx_block_diagram.jpg" height="400" width="355" /></a></div>
<div style="text-align: justify;">
Each SM (Compute Unit) consists of 4 schedulers. Each scheduler handles 32 threads (WARP size), thus 4x32=128 threads are the minimum requirement per SM. A GTX660 with 8 CUs requires at least 1024 active threads.</div>
<br />
<div style="text-align: justify;">
In addition, more independent instructions are required in the instruction stream (instruction level parallelism) in order to utilize the extra 64 shaders of each CU (192 in total).</div>
<br />
<br />
<br />
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<h3>
</h3>
<h3>
</h3>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<h3>
</h3>
<h3>
NVidia Maxwell GPUs</h3>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://www.extremetech.com/wp-content/uploads/2014/02/Maxwell-SMM.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="http://www.extremetech.com/wp-content/uploads/2014/02/Maxwell-SMM.png" height="400" width="212" /></a></div>
<div style="text-align: justify;">
Same as Kepler. A GTX660 with 8 CUs requires at least 1024 active threads. A GTX980 with 16 CUs requires 2048 active threads.</div>
<br />
<div style="text-align: justify;">
The requirement for instruction independency does not apply here (only 128 threads per CU).</div>
<br />
<br />
<br />
<h3>
</h3>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<h3>
</h3>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<div>
<br /></div>
<h3>
</h3>
<h3>
AMD GCN GPUs</h3>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://www.guru3d.com/miraserver/images/news/2012/GCN_CU.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="http://www.guru3d.com/miraserver/images/news/2012/GCN_CU.jpg" height="187" width="400" /></a></div>
<div style="text-align: justify;">
Regarding the AMD GCN units the requirement is more evident. This is because each scheduler handles threads in four groups, one for each SIMD unit. This is like having 4 schedulers per CU. Furthermore the unit of thread execution is done per 64 threads instead of 32. Therefore each CU requires the least of 4x64=256 threads. For instance a R9-280X with 32 CUs require a vast amount of 8192 threads! This fact justifies the reason for which in many research papers the AMD GPUs fail to stand against NVidia GPUs for small problem sizes where the amount of active threads is not enough.</div>
<br />
<br />
<br />Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-42152478591513993142014-05-23T12:57:00.001+03:002014-05-23T12:57:05.813+03:00IWOCL 2014 (International Workshop on OpenCL) presentations available onlinePresentation files of the the IWOCL (International Workshop on OpenCL) 2014 are available for download.<br /><br />URL:<br /><a href="http://iwocl.org/iwocl-2014/agenda-and-slides/">http://iwocl.org/iwocl-2014/agenda-and-slides/</a><br /><br /><br /><i>Note: The International Workshop on OpenCL (IWOCL) is an annual meeting of OpenCL users, researchers, developers and suppliers to share OpenCL best practise, and to promote the evolution and advancement of the OpenCL standard. The meeting is open to anyone who is interested in contributing to, and participating in the OpenCL community.</i><div>
<span style="background-color: white; color: #141412; font-family: 'Source Sans Pro', Helvetica, sans-serif; font-size: 16px; line-height: 24px;"><br /></span></div>
Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-17142202217204491502014-04-16T15:38:00.000+03:002014-04-17T08:19:37.373+03:00Loop execution performance comparison in various programming languages<div style="text-align: justify;">
The main focus of a GPU programmer is performance. Therefore the execution time of various time consuming loops is of significant consideration. In this regard I performed some experiments in various programming languages of a small nested loop. The problem investigated is a trivial one though it needs significant number of operations to be performed in a nested loop form.<br />
<br />
<h3>
Problem definition</h3>
<div>
<br /></div>
Search for a pair of integers in the [1..15000] range whose multiple is equal to 87654321.<br />
<br />
<h3>
Loop implementations</h3>
<br />
A trivial solution of this problem is provided in the following python code:<br />
<pre class="prettyprint lang-python">for i in range(1, 15001):
for j in range(i+1, 15001):
if i*j==87654321:
print "Found! ",str(i)," ",str(j)
break
</pre>
<br />
Converting the code above to C is straightforward. The code can be easily parallelized using OpenMP constructs by adding a single line:<br />
<br />
<pre class="prettyprint lang-c">#pragma omp parallel for private(j) schedule(dynamic,500)
for(i=1; i<=15000; i++)
for(j=i+1; j<=15000; j++)
if( i*j==87654321 )
printf("Found! %d %d\n", i, j);
</pre>
<br />
The schedule parameter directs the compiler to apply dynamic scheduling in order to address the unbalanced nature of the iterations (first outer loop performs 14999 operations while the last one does none).<br />
<br />
A naive implementation in OpenCL is also provided. A workitem is assigned to each iteration of the outer loop:<br />
<br />
<pre class="prettyprint lang-opencl">__kernel void factor8to1(unsigned int limit, global int *results){
int i = get_global_id(0);
if( i<=limit )
for(int j=i+1; j<=limit; j++)
if( i*j==87654321 ){
results[0] = i;
results[1] = j;
}
}
</pre>
<br />
The OpenCL kernel requires to be launched with an NDRange of 15000 workitems. These are not adequate especially for large GPUs but they should be enough for a demo.<br />
<br />
Of course this kernel is not well balanced neither optimized, in order to be clear to read and understand. Note that the goal of this project is not to provide an optimized factorization algorithm but to demonstrate the loop code efficiency in various scripting and compiled languages, as well as, to provide a glimpse to the gains of parallel processing.</div>
<br />
Code is written in the following languages:<br />
<ol>
<li> Python</li>
<li> JavaScript</li>
<li> Free pascal compiler</li>
<li> C</li>
<li> OpenMP/C</li>
<li> OpenCL</li>
</ol>
<br />
<div>
All sources are provided on github: <a href="https://github.com/ekondis/factor87654321" target="_blank">https://github.com/ekondis/factor87654321</a><br />
<br /></div>
<h3>
Execution results on A6-1450 APU</h3>
<div>
<br /></div>
<div style="text-align: justify;">
Here are provided the execution results of executions on an AMD A6-1450 APU which is a low power processing unit which combines a CPU and a GPU on the same die package. It features a quad core CPU (Jaguar cores) running at 1GHz and a GCN-GPU with 2 compute units (128 processing elements in total).</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcMtrWrGfCkNvjh22tJOEizTGM6atPcI3CqCMOOC986w2CYbeIvVJ5minyj4qrYy39kvtFHnl-pfy645t8dGdQ0a2rZMVmY_bEv7gyFyITssSg3_Z9IFD0i9kzCa_H0Rphy0M_C0JDpV_h/s1600/resultss.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcMtrWrGfCkNvjh22tJOEizTGM6atPcI3CqCMOOC986w2CYbeIvVJ5minyj4qrYy39kvtFHnl-pfy645t8dGdQ0a2rZMVmY_bEv7gyFyITssSg3_Z9IFD0i9kzCa_H0Rphy0M_C0JDpV_h/s1600/resultss.png" height="520" width="640" /></a></div>
<br />
The benefits of parallel processing are apparent. The advancements of javascript JIT engines are also evident.Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-70318185550320693972014-02-18T20:05:00.001+02:002014-02-18T20:05:19.475+02:00Maxwell further lowers double precision performance for GeForce GPUs<div style="text-align: justify;">
Now this double precision mockery seems to have no end. For top end Fermi based GPUs the ratio was 1/8 which was just acceptable. For the rest Fermi GPUs the ratio became 1/12. Thereafter, Kepler further reduced it to 1/24. And today we learn that <b>the first Maxwell GPUs further cut it to 1/32</b>!</div>
<br />
<div style="text-align: justify;">
As long as NVidia wants to sell as more Teslas as it gets we will never be able to achieve acceptable performance in double precision arithmetic from consumer cards. Actually, using a consumer GPU (excluding GTX Titan) for a compute intensive problem does not worth considering the CPU improvements with 256bit AVX2 plus the addition of FMA instructions. And certainly not everyones has 1000$ to waste for a GTX Titan. I would expect a decent double precision performance from a mid-range card of, lets say 300$, but unfortunately that's not the case.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
I hope the next architecture dubbed Volta will not emply a 1/128 ratio though it doesn't actually make much difference if it is 1/32, 1/64 or 1/128. These ratios turn double precision compute on consumer cards meaningless.</div>
<br />
Source: <a href="http://www.tomshardware.com/reviews/geforce-gtx-750-ti-review,3750.html#xtor=RSS-182" target="_blank">http://www.tomshardware.com/reviews/geforce-gtx-750-ti-review,3750.html#xtor=RSS-182</a><br />
<br />Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-4933615119863697352014-02-15T12:14:00.001+02:002014-02-15T12:14:21.272+02:00AMD Catalyst 14.1 and OpenCL SPIR<div style="text-align: justify;">
I recently noticed that the AMD Catalyst 14.1 BETA seemed to enable a very interesting extension. Look the extract of the clinfo command bellow executed on a HD-7750:</div>
<br />
<br />
<pre class="prettyprint lang-html">Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.2 AMD-APP (1411.4)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_amd_hsa
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Device ID: 4098
Board name: AMD Radeon HD 7700 Series
Device Topology: PCI[ B#5, D#0, F#0 ]
Max compute units: 8
Max work items dimensions: 3
Max work items[0]: 256
Max work items[1]: 256
Max work items[2]: 256
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 820Mhz
Address bits: 32
Max memory allocation: 685349273
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 802160640
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue properties:
Out-of-Order: No
Profiling : Yes
Platform ID: 0xb7446660
Name: Capeverde
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.2
Driver version: 1411.4 (VM)
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (1411.4)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir
</pre>
<br />
<div style="text-align: justify;">
Just look at look at the last line of the supported extensions of the device. There is a magic word called <b>cl_khr_spir</b>! Does this mean that SPIR is already supported by the driver? I don't know and I haven't performed any tests yet. Unfortunately I don't have much time to do it now but if anyone does please let me know.</div>
<br />
<br />Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0tag:blogger.com,1999:blog-1855371117255475765.post-33060602243680862562014-01-28T19:45:00.000+02:002014-01-28T19:45:36.976+02:00Benchmarking the capabilities of your OpenCL device with clpeak, etc.In case you're interested in benchmarking the performance of your GPU/CPU with OpenCL you could try a simple program named <i>clpeak</i>. It's hosted on github: <a href="https://github.com/krrishnarraj/clpeak" target="_blank">https://github.com/krrishnarraj/clpeak</a><br />
<br />
For instance here is the output on the A4-1450 APU.<br />
<pre class="prettyprint lang-html">Platform: AMD Accelerated Parallel Processing
Device: Kalindi
Driver version : 1214.3 (VM) (Linux x64)
Compute units : 2
Global memory bandwidth (GBPS)
float : 6.60
float2 : 6.71
float4 : 6.45
float8 : 3.51
float16 : 1.83
Single-precision compute (GFLOPS)
float : 100.63
float2 : 101.26
float4 : 100.94
float8 : 100.32
float16 : 99.08
Double-precision compute (GFLOPS)
double : 6.35
double2 : 6.37
double4 : 6.36
double8 : 6.34
double16 : 6.32
Integer compute (GIOPS)
int : 20.33
int2 : 20.39
int4 : 20.36
int8 : 20.33
int16 : 20.32
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 1.80
enqueueReadBuffer : 1.98
enqueueMapBuffer(for read) : 84.42
memcpy from mapped ptr : 1.81
enqueueUnmap(after write) : 54.32
memcpy to mapped ptr : 1.87
Kernel launch latency : 138.08 us
Device: AMD A6-1450 APU with Radeon(TM) HD Graphics
Driver version : 1214.3 (sse2,avx) (Linux x64)
Compute units : 4
Global memory bandwidth (GBPS)
float : 1.97
float2 : 2.51
float4 : 1.95
float8 : 2.79
float16 : 3.54
Single-precision compute (GFLOPS)
float : 1.30
float2 : 2.50
float4 : 5.01
float8 : 9.21
float16 : 1.07
Double-precision compute (GFLOPS)
double : 0.62
double2 : 1.35
double4 : 2.56
double8 : 6.27
double16 : 2.44
Integer compute (GIOPS)
int : 1.60
int2 : 1.22
int4 : 4.70
int8 : 8.08
int16 : 7.91
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 2.67
enqueueReadBuffer : 2.03
enqueueMapBuffer(for read) : 13489.22
memcpy from mapped ptr : 2.02
enqueueUnmap(after write) : 26446.84
memcpy to mapped ptr : 2.03
Kernel launch latency : 32.74 us
</pre>
<br />
<br />
P.S.<br />
1) Some performance measures of the recently released Kaveri APU are provided on Anandtech:<br />
<a href="http://www.anandtech.com/show/7711/floating-point-peak-performance-of-kaveri-and-other-recent-amd-and-intel-chips" target="_blank">http://www.anandtech.com/show/7711/floating-point-peak-performance-of-kaveri-and-other-recent-amd-and-intel-chips</a><br />
2) If you are interested you can find the presentation of the Kaveri on Tech-Day in PDF format here:<br />
<a href="http://www.pcmhz.com/media/2014/01-ianuarie/14/amd/AMD-Tech-Day-Kaveri.pdf" target="_blank">http://www.pcmhz.com/media/2014/01-ianuarie/14/amd/AMD-Tech-Day-Kaveri.pdf</a><br />
3) The Alpha 2 of Ubuntu 14.04 seems to resolve the shutdown problem of the Temash laptop (Acer Aspire v5 122p). It must be due to the 3.13 kernel update. So, I'm looking forward to the final Ubuntu 14.04 release.<br />
<br />Eliashttp://www.blogger.com/profile/13741870397057347480noreply@blogger.com0