<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Data Science Letter]]></title><description><![CDATA[Actionable insights on real-world applications of AI, Machine Learning, and Data Science. Discover how leading industries are using data to innovate!]]></description><link>https://newsletter.datascienceletter.com</link><image><url>https://substackcdn.com/image/fetch/$s_!XHLE!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a75787e-c78d-4a27-a5dc-e2bf1202b04f_500x500.png</url><title>Data Science Letter</title><link>https://newsletter.datascienceletter.com</link></image><generator>Substack</generator><lastBuildDate>Thu, 09 Apr 2026 19:38:32 GMT</lastBuildDate><atom:link href="https://newsletter.datascienceletter.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Data Science Letter]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[datascienceletter@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[datascienceletter@substack.com]]></itunes:email><itunes:name><![CDATA[Data Science Letter]]></itunes:name></itunes:owner><itunes:author><![CDATA[Data Science Letter]]></itunes:author><googleplay:owner><![CDATA[datascienceletter@substack.com]]></googleplay:owner><googleplay:email><![CDATA[datascienceletter@substack.com]]></googleplay:email><googleplay:author><![CDATA[Data Science Letter]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Gemma 3 Is Here — Google's Open-Source LLMs Just Got a Big Upgrade]]></title><description><![CDATA[A complete comparison of Gemma 1, 2, and 3 &#8212; model sizes, architecture changes, multimodal support, context length upgrades, and what makes Gemma 3 a powerful open-source model]]></description><link>https://newsletter.datascienceletter.com/p/gemma-3-is-here-googles-open-source</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/gemma-3-is-here-googles-open-source</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Wed, 02 Apr 2025 21:31:19 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/92dd7251-1cb2-4334-88b9-c5a78ae0213b_7952x5304.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Earlier in March, Google released <strong>Gemma 3</strong>, the newest release in its open-source LLM series &#8212; and it's a significant leap over its predecessors!</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iWh8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd393f90a-e714-4455-9e78-070d6e9ddf94_1800x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iWh8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd393f90a-e714-4455-9e78-070d6e9ddf94_1800x300.png 424w, https://substackcdn.com/image/fetch/$s_!iWh8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd393f90a-e714-4455-9e78-070d6e9ddf94_1800x300.png 848w, https://substackcdn.com/image/fetch/$s_!iWh8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd393f90a-e714-4455-9e78-070d6e9ddf94_1800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!iWh8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd393f90a-e714-4455-9e78-070d6e9ddf94_1800x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iWh8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd393f90a-e714-4455-9e78-070d6e9ddf94_1800x300.png" width="1456" height="243" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d393f90a-e714-4455-9e78-070d6e9ddf94_1800x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:243,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:130348,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.datascienceletter.com/i/160031640?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd393f90a-e714-4455-9e78-070d6e9ddf94_1800x300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iWh8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd393f90a-e714-4455-9e78-070d6e9ddf94_1800x300.png 424w, https://substackcdn.com/image/fetch/$s_!iWh8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd393f90a-e714-4455-9e78-070d6e9ddf94_1800x300.png 848w, https://substackcdn.com/image/fetch/$s_!iWh8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd393f90a-e714-4455-9e78-070d6e9ddf94_1800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!iWh8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd393f90a-e714-4455-9e78-070d6e9ddf94_1800x300.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>In this article, we will cover:</p><ul><li><p>Major upgrade points of Gemma 3 over its predecessors</p></li><li><p>Technical detail of Gemma 3</p></li><li><p>Knowledge recap &#8212; knowledge distillation</p></li><li><p>Knowledge recap &#8212; quantisation</p></li><li><p>Benchmark performance of Gemma 3</p></li><li><p>How to use image as inputs with Gemma 3?</p></li><li><p>Gemma 1 and 2 recap</p></li><li><p>Final thoughts</p></li></ul><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vC3U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34e858b7-7b36-492b-9f98-710ed9e3d96a_1800x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vC3U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34e858b7-7b36-492b-9f98-710ed9e3d96a_1800x300.png 424w, https://substackcdn.com/image/fetch/$s_!vC3U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34e858b7-7b36-492b-9f98-710ed9e3d96a_1800x300.png 848w, https://substackcdn.com/image/fetch/$s_!vC3U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34e858b7-7b36-492b-9f98-710ed9e3d96a_1800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!vC3U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34e858b7-7b36-492b-9f98-710ed9e3d96a_1800x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vC3U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34e858b7-7b36-492b-9f98-710ed9e3d96a_1800x300.png" width="1456" height="243" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/34e858b7-7b36-492b-9f98-710ed9e3d96a_1800x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:243,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:142634,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.datascienceletter.com/i/160031640?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34e858b7-7b36-492b-9f98-710ed9e3d96a_1800x300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vC3U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34e858b7-7b36-492b-9f98-710ed9e3d96a_1800x300.png 424w, https://substackcdn.com/image/fetch/$s_!vC3U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34e858b7-7b36-492b-9f98-710ed9e3d96a_1800x300.png 848w, https://substackcdn.com/image/fetch/$s_!vC3U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34e858b7-7b36-492b-9f98-710ed9e3d96a_1800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!vC3U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34e858b7-7b36-492b-9f98-710ed9e3d96a_1800x300.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><ul><li><p><em><strong>Strong performance with practical model size</strong></em></p><ul><li><p>Gemma 3 is available in 1B, 4B, 12B, and 27B parameter.</p></li><li><p>Despite this small model size, it significantly outperforms many larger models (The 27B is currently ranked at the 12th place, <strong>outperforming Llama3-405B, DeepSeek-V3 and o3-mini</strong>!).</p></li><li><p>Gemma 3 can fit in one single GPU for on-device inference.</p></li></ul></li><li><p><em><strong>The longest context window on the open-source market</strong></em></p><ul><li><p>The 4B, 12B, and 27B model support a 128K token context window, while 1B model supports 32K. </p></li></ul></li><li><p><em><strong>Multi-lingual</strong></em></p><ul><li><p>Support 140+ languages!</p></li><li><p>This is a major improvement from Gemma 2, which only supports English text</p></li></ul></li><li><p><em><strong>Multi-modal</strong></em></p><ul><li><p>Inputs to Gemma 3 can now include images! This enables Gemma 3 to be used for tasks such as visual Q&amp;A, image captioning, and document analysis that includes images.</p></li></ul></li></ul><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i4jy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1f112a-952e-4860-8801-6aa11c86bdaf_1800x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i4jy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1f112a-952e-4860-8801-6aa11c86bdaf_1800x300.png 424w, https://substackcdn.com/image/fetch/$s_!i4jy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1f112a-952e-4860-8801-6aa11c86bdaf_1800x300.png 848w, https://substackcdn.com/image/fetch/$s_!i4jy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1f112a-952e-4860-8801-6aa11c86bdaf_1800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!i4jy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1f112a-952e-4860-8801-6aa11c86bdaf_1800x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i4jy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1f112a-952e-4860-8801-6aa11c86bdaf_1800x300.png" width="1456" height="243" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9f1f112a-952e-4860-8801-6aa11c86bdaf_1800x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:243,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:116487,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.datascienceletter.com/i/160031640?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1f112a-952e-4860-8801-6aa11c86bdaf_1800x300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!i4jy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1f112a-952e-4860-8801-6aa11c86bdaf_1800x300.png 424w, https://substackcdn.com/image/fetch/$s_!i4jy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1f112a-952e-4860-8801-6aa11c86bdaf_1800x300.png 848w, https://substackcdn.com/image/fetch/$s_!i4jy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1f112a-952e-4860-8801-6aa11c86bdaf_1800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!i4jy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1f112a-952e-4860-8801-6aa11c86bdaf_1800x300.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>I also compiled a list of detailed breakdown for Gemma 3 for you. Enjoy!</p><h4>Model Sizes</h4><ul><li><p><strong>1B, 4B, 12B, 27B</strong> &#8212; all trained via <strong>knowledge distillation</strong></p></li><li><p>However, in the paper it is not specified what teacher models are used for knowledge distillation.</p></li></ul><h4>Context Window</h4><ul><li><p>Extended to <strong>32K tokens</strong> for 1B</p></li><li><p>Up to <strong>128K tokens</strong> for 4B, 12B and 27B</p></li></ul><h4>Multimodal + Multilingual</h4><ul><li><p><strong>Multimodal input</strong> via SigLIP vision encoder</p></li><li><p><strong>Multilingual support</strong>, a big shift from Gemma 1 &amp; 2</p></li></ul><h4>Architectural Highlights</h4><ul><li><p>Optimised local/global attention ratios</p></li><li><p>Grouped-Query Attention</p></li><li><p>Pan &amp; Scan &#8212; a technique to reduce visual artifacts in image processing</p></li></ul><h4>Tokenisation</h4><ul><li><p>Still based on SentencePiece, but now with:</p><ul><li><p>Split digits (e.g., "123" &#8594; "1", "2", "3")</p></li><li><p>Preserved whitespace</p></li><li><p>Byte-level fallback encoding for rare chars</p></li></ul></li></ul><h4>Quantisation-aware training</h4><ul><li><p>Built-in QAT support for better low-precision performance</p></li></ul><h4>Further resources</h4><p><a href="https://arxiv.org/abs/2503.19786">Gemma 3 technical report</a></p><p><a href="https://ai.google.dev/gemma/docs/core">Gemma documentation</a></p><p><a href="https://arxiv.org/abs/2303.15343">Sigmoid Loss for Language Image Pre-Training</a></p><p><a href="https://arxiv.org/abs/2502.14786">SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features</a></p><p><a href="https://arxiv.org/abs/2305.13245">GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints</a></p><p><a href="https://arxiv.org/abs/1808.06226">SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing</a></p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hX7Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hX7Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 424w, https://substackcdn.com/image/fetch/$s_!hX7Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 848w, https://substackcdn.com/image/fetch/$s_!hX7Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!hX7Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hX7Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png" width="1456" height="243" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71596df1-4414-49ba-a781-866a34325929_1800x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:243,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:143782,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.datascienceletter.com/i/160031640?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hX7Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 424w, https://substackcdn.com/image/fetch/$s_!hX7Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 848w, https://substackcdn.com/image/fetch/$s_!hX7Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!hX7Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h3>What is quantisation?</h3><h4>Common format: FP64, FP32, FP16, BF16, INT8</h4><p>There are a few basic data types to know to understand quantisation in LLMs. In general, model parameters are encoded in a certain data format. The most common types are:</p><ul><li><p>Floating point &#8212; FP64 (64-bit), FP32 (32-bit)</p></li><li><p>BFloat 16-bit (BF16)</p></li><li><p>Integer 8-bit (INT8)</p></li></ul><h4>Floating point</h4><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r4gP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe71df39-bd9c-4303-84a4-3d891de4ac49_1920x245.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r4gP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe71df39-bd9c-4303-84a4-3d891de4ac49_1920x245.png 424w, https://substackcdn.com/image/fetch/$s_!r4gP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe71df39-bd9c-4303-84a4-3d891de4ac49_1920x245.png 848w, https://substackcdn.com/image/fetch/$s_!r4gP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe71df39-bd9c-4303-84a4-3d891de4ac49_1920x245.png 1272w, https://substackcdn.com/image/fetch/$s_!r4gP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe71df39-bd9c-4303-84a4-3d891de4ac49_1920x245.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r4gP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe71df39-bd9c-4303-84a4-3d891de4ac49_1920x245.png" width="1456" height="186" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be71df39-bd9c-4303-84a4-3d891de4ac49_1920x245.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:186,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52062,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.datascienceletter.com/i/160031640?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe71df39-bd9c-4303-84a4-3d891de4ac49_1920x245.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!r4gP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe71df39-bd9c-4303-84a4-3d891de4ac49_1920x245.png 424w, https://substackcdn.com/image/fetch/$s_!r4gP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe71df39-bd9c-4303-84a4-3d891de4ac49_1920x245.png 848w, https://substackcdn.com/image/fetch/$s_!r4gP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe71df39-bd9c-4303-84a4-3d891de4ac49_1920x245.png 1272w, https://substackcdn.com/image/fetch/$s_!r4gP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe71df39-bd9c-4303-84a4-3d891de4ac49_1920x245.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">FP32 representation (source: Wikipedia)</figcaption></figure></div><p>The &#8220;floating point&#8221; format is one of the most common ways to represent numerical values in modern computers. The General structure:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathrm{value} = (-1)^s \\times 1.m \\times 2^{e-bias}&quot;,&quot;id&quot;:&quot;LPJZBNBCSJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where:</p><ul><li><p>s = sign bit (0 for positive, 1 for negative)</p></li><li><p>m = mantissa (fractional part)</p></li><li><p>e = exponent</p></li><li><p>bias is used to represent both positive and negative exponents</p></li></ul><p>Here is a breakdown for each bit:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/LlVRP/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bba9c9bc-bedb-4553-b20d-91b63dd6ed54_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:360,&quot;title&quot;:&quot;Floating point breakdown&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/LlVRP/1/" width="730" height="360" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>The total range supported by 64-bit, 32-bit and 16-bit are massively different:</p><ul><li><p>64-bit: ~ &#177; 10^308</p></li><li><p>32-bit: ~ &#177; 10^38 </p></li><li><p>16-bit: ~ &#177; 10^4</p></li></ul><h4>BF16</h4><p>If we decrease the number of bits used in a floating point data format, the numerical range represented will vary. To keep the approximate same range for model training or inference, one can use formats such as bfloat16.</p><p><strong>bfloat16 (BF16)</strong> is a <strong>16-bit floating point format</strong> designed to provide <strong>speed and memory efficiency</strong> like FP16, while preserving much of the <strong>range and numerical stability</strong> of FP32.</p><p>It was introduced by Google and is widely used in <strong>deep learning hardware</strong> (like TPUs and newer CPUs/GPUs) because it strikes a great balance between <strong>performance</strong> and <strong>training stability</strong>. </p><h4>Int8</h4><p><strong>Int8</strong> (8-bit integer) is a compact numerical format that uses just 8 bits to represent whole numbers, typically ranging from -128 to 127 (signed) or 0 to 255 (unsigned). In practice, <strong>Int8 is used to represent floating point values approximately</strong>, using a transformation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_{\\mathrm{float}} \\approx (x_{\\mathrm{int8}} - \\mathrm{zeropoint}) \\times \\mathrm{scale}&quot;,&quot;id&quot;:&quot;JKFEXOJZPY&quot;}" data-component-name="LatexBlockToDOM"></div><ul><li><p><strong>Scale</strong>: the resolution (float step size)</p></li><li><p><strong>Zero-point</strong>: an offset that maps integer zero to the corresponding float</p></li></ul><h4>Post training quantisation</h4><p><strong>Post training quantisation (PTQ)</strong> is a technique used to convert a trained neural network (typically in 32-bit floating point precision, FP32) into a smaller and faster version by reducing the numerical precision of weights and activations &#8212; <strong>after</strong> training is completed.</p><ul><li><p>Step 1: model training</p><ul><li><p>You train a model normally in FP32 for accuracy and stability.</p></li></ul></li><li><p>Step 2: quantisation</p><ul><li><p>You convert:</p><ul><li><p><strong>Weights</strong> &#8594; to INT8 (or FP16)</p></li><li><p><strong>Activations</strong> &#8594; either dynamically during inference, or using calibration data</p></li></ul><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_{\\mathrm{int8}} = \\mathrm{round}(\\frac{x_{\\mathrm{float}}}{\\mathrm{scale}}) + \\mathrm{zeropoint}&quot;,&quot;id&quot;:&quot;QGIVQPSPCS&quot;}" data-component-name="LatexBlockToDOM"></div></li></ul></li><li><p>Step 3: calibration [optional]</p><ul><li><p>You run a <strong>small sample of input data</strong> through the model to estimate ranges (min/max) for activations.</p></li><li><p>This improves accuracy by choosing better quantisation parameters.</p></li></ul></li></ul><h4>Quantisation-aware training</h4><p>Quantisation-aware training (QAT) is a technique where a neural network is trained to be aware of quantisation effects (often during training). QAT usually simulates quantisation during training to preserve accuracy.</p><h4>&#9881;&#65039; How it works</h4><ol><li><p>&#128257; Simulate quantisation during training</p><ul><li><p>During forward passes, the model pretends weights and activations are low precision (e.g., INT8), but still stores and updates them in high precision (FP32).</p></li><li><p>This is done by inserting "fake quantisation" modules that simulate rounding and clamping to int ranges.</p></li></ul></li><li><p>&#129520; Fine-Tune with quantisation noise</p><ul><li><p>The model learns to adapt to the noise introduced by quantisation.</p></li><li><p>Back-propagation and weight updates still happen in full precision, but gradients reflect the quantised behaviour.</p></li></ul></li><li><p>&#129504; Export fully quantised model</p><ul><li><p>After training, you convert weights and activations to true lower precision (e.g., INT8).</p></li><li><p>The model is now fully quantised and optimised for deployment.</p></li></ul></li></ol><h4>Further resources</h4><p><a href="https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization">A Visual Guide to Quantisation</a></p><p><a href="https://en.wikipedia.org/wiki/IEEE_754">IEEE754 wikipedia</a></p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hX7Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hX7Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 424w, https://substackcdn.com/image/fetch/$s_!hX7Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 848w, https://substackcdn.com/image/fetch/$s_!hX7Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!hX7Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hX7Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png" width="1456" height="243" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71596df1-4414-49ba-a781-866a34325929_1800x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:243,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hX7Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 424w, https://substackcdn.com/image/fetch/$s_!hX7Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 848w, https://substackcdn.com/image/fetch/$s_!hX7Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!hX7Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71596df1-4414-49ba-a781-866a34325929_1800x300.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h3>What is knowledge distillation?</h3><p><strong>Knowledge distillation</strong> is a technique where a smaller, simpler model (called the <em>student</em>) is trained to mimic the output of a larger, more complex model (called the <em>teacher</em>).</p><p>In the LLM landscape, there are two common ways to perform knowledge distillation:</p><ul><li><p>Optimise the student model with token probability output from the teacher model as <strong>soft-label</strong></p></li><li><p>Optimise the student model with text dataset prompted with the teacher model</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vnl2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d0bf7f-7bb2-4b43-ba59-b8859aaeb8ca_3429x1166.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vnl2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d0bf7f-7bb2-4b43-ba59-b8859aaeb8ca_3429x1166.jpeg 424w, https://substackcdn.com/image/fetch/$s_!vnl2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d0bf7f-7bb2-4b43-ba59-b8859aaeb8ca_3429x1166.jpeg 848w, https://substackcdn.com/image/fetch/$s_!vnl2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d0bf7f-7bb2-4b43-ba59-b8859aaeb8ca_3429x1166.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!vnl2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d0bf7f-7bb2-4b43-ba59-b8859aaeb8ca_3429x1166.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vnl2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d0bf7f-7bb2-4b43-ba59-b8859aaeb8ca_3429x1166.jpeg" width="1456" height="495" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/85d0bf7f-7bb2-4b43-ba59-b8859aaeb8ca_3429x1166.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:495,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:231718,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.datascienceletter.com/i/160031640?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d0bf7f-7bb2-4b43-ba59-b8859aaeb8ca_3429x1166.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!vnl2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d0bf7f-7bb2-4b43-ba59-b8859aaeb8ca_3429x1166.jpeg 424w, https://substackcdn.com/image/fetch/$s_!vnl2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d0bf7f-7bb2-4b43-ba59-b8859aaeb8ca_3429x1166.jpeg 848w, https://substackcdn.com/image/fetch/$s_!vnl2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d0bf7f-7bb2-4b43-ba59-b8859aaeb8ca_3429x1166.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!vnl2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d0bf7f-7bb2-4b43-ba59-b8859aaeb8ca_3429x1166.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3STP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa854e45-f525-477a-b24d-020b0da6b099_3351x1193.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3STP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa854e45-f525-477a-b24d-020b0da6b099_3351x1193.jpeg 424w, https://substackcdn.com/image/fetch/$s_!3STP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa854e45-f525-477a-b24d-020b0da6b099_3351x1193.jpeg 848w, https://substackcdn.com/image/fetch/$s_!3STP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa854e45-f525-477a-b24d-020b0da6b099_3351x1193.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!3STP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa854e45-f525-477a-b24d-020b0da6b099_3351x1193.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3STP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa854e45-f525-477a-b24d-020b0da6b099_3351x1193.jpeg" width="1456" height="518" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa854e45-f525-477a-b24d-020b0da6b099_3351x1193.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:518,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:239808,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.datascienceletter.com/i/160031640?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa854e45-f525-477a-b24d-020b0da6b099_3351x1193.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!3STP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa854e45-f525-477a-b24d-020b0da6b099_3351x1193.jpeg 424w, https://substackcdn.com/image/fetch/$s_!3STP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa854e45-f525-477a-b24d-020b0da6b099_3351x1193.jpeg 848w, https://substackcdn.com/image/fetch/$s_!3STP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa854e45-f525-477a-b24d-020b0da6b099_3351x1193.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!3STP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa854e45-f525-477a-b24d-020b0da6b099_3351x1193.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Further resources</h4><ul><li><p><a href="https://arxiv.org/abs/1503.02531">Distilling the Knowledge in a Neural Network</a> G. Hinton et al. (2015)</p></li><li><p><a href="https://arxiv.org/abs/1910.01108">DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter</a></p><p>Victor Sanh et al. (2019)</p></li><li><p><a href="https://arxiv.org/abs/2402.13116">A Survey on Knowledge Distillation of Large Language Models</a></p><p>X. Xu et al. (2024)</p><div><hr></div></li></ul><h3>Benchmark performance</h3><p>For those who are not familiar with the common LLM benchmarks, I compiled a quick summary for you below. Enjoy!</p><h4>Chatbot Arena</h4><ul><li><p>One of the most referenced LLM leaderboards</p></li><li><p>Human preferences on AI-generated outputs</p></li><li><p>Evaluate and compare LLMs based on human preferences. Users can rank two AI-generated responses without knowing which models produced them</p></li><li><p>Developed by researchers at UC Berkeley</p></li></ul><blockquote><p><strong>Performance</strong></p><p>Gemma 3 27B ranks 12th with an Elo score of 1339, outperforming larger open models like DeepSeek-V3, LLaMA 3.1 405B, and Qwen2.5-72B!</p></blockquote><h4>Further resources</h4><p><a href="https://arxiv.org/abs/2403.04132">Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference</a></p><p><a href="https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard">Leaderboard</a></p><div><hr></div><h4>MMLU-Pro</h4><ul><li><p>Language comprehension and reasoning</p></li><li><p>Topics include Biology, Business, Chemistry, Computer, Economics, Engineering, Health, History, Law, Math, Philosophy, Physics, Psychology, Other</p></li></ul><blockquote><p><strong>Performance</strong></p><p>Gemma 3 27B achieves 0.676, outperforming Llama 3.3 70B!</p></blockquote><p>This benchmark an enhanced version of the original Massive Multitask Language Understanding (MMLU) benchmark, designed to more rigorously evaluate the capabilities of large language models (LLMs) in language comprehension and reasoning across diverse domains.</p><p>An example of this benchmark is the following:</p><pre><code><strong>Question:</strong> Which of the following cases established the precedent that a defendant must be informed of the right to remain silent, the right to a lawyer, and protection from self-incrimination?&#8203;

Options:
A) Brown v. Board of Education&#8203; 
B) Miranda v. Arizona&#8203;
C) Roe v. Wade
D) Betts v. Brady&#8203;
E) Plessy v. Ferguson&#8203;
F) Dred Scott v. Sandford&#8203;
G) Weeks v. United States&#8203;
H) Gideon v. Wainwright&#8203; 
I) Marbury v. Madison&#8203;
J) Mapp v. Ohio&#8203;

Answer: B) Miranda v. Arizona&#8203;

Explanation: In the landmark case Miranda v. Arizona (1966), the U.S. Supreme Court ruled that individuals taken into police custody must be informed of their rights to remain silent and to have an attorney present during questioning. This decision established the "Miranda rights," ensuring protection against self-incrimination under the Fifth Amendment.</code></pre><h4>Further resources</h4><p><a href="https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro">MMLU-Pro leaderboard</a></p><div><hr></div><h4>LiveCodeBench</h4><ul><li><p>This benchmark assesses code generation capabilities on real-world coding problems from platforms like LeetCode and Codeforces.</p></li><li><p>There are four major areas: code generation, self-repair, test-output prediction and code execution</p></li></ul><blockquote><p><strong>Performance</strong></p><p>Gemma 3 27B achieves 29.7, while the score for Gemini-Flash-2.0-Exp are 31.8!</p></blockquote><h4>Further resources</h4><p><a href="https://livecodebench.github.io/leaderboard.html">Leaderboard</a></p><div><hr></div><h4>Bird-SQL</h4><ul><li><p>Tests a model's ability to translate natural language questions into complex SQL queries across various domains.</p></li></ul><blockquote><p><strong>Performance</strong></p><p>Gemma 3 27B achieves 54.4, while the score for Gemini-1.5 are also 54.4!</p></blockquote><h4>Further resources</h4><p><a href="https://bird-bench.github.io/">Leaderboard</a></p><div><hr></div><h4>GPQA Diamond</h4><ul><li><p>This is a challenging dataset which comprises 448 multiple-choice questions across the domains of biology, physics, and chemistry, crafted by domain experts to ensure high quality and difficulty in PhD-level</p></li></ul><blockquote><p><strong>Performance</strong></p><p>Gemma 3 27B achieves 42.4, while the score for GPT-4o (0513) is 53.6%</p></blockquote><h4>Further resources</h4><p><a href="https://arxiv.org/abs/2311.12022">GPQA: A Graduate-Level Google-Proof Q&amp;A Benchmark</a></p><p><a href="https://klu.ai/glossary/gpqa-eval">Leaderboard</a></p><div><hr></div><h4> MATH</h4><ul><li><p>Problem-solving, reasoning, mathematics</p></li><li><p>A benchmark consisting of over 12,000 high school-level mathematical problems.</p></li></ul><blockquote><p><strong>Performance</strong></p><p>Gemma 3 27B achieves 89.0, while the score for Gemini 2.0 is 91.8</p></blockquote><h4><strong>Further resources</strong></h4><p><a href="https://github.com/hendrycks/math">Github</a></p><div><hr></div><h4>SimpleQA &amp; FACTS Grounding</h4><ul><li><p>Measures LLMs&#8217; capacities to produce factual outputs as LLMs sometimes hallucinate </p></li></ul><blockquote><p><strong>Performance</strong></p><p>SimpleQA &#8212; Gemma 3 27B achieves only 10.0, while the score for Gemini 2.0 is 44.3!</p><p>FACTS Grounding &#8212; Gemma 3 27B achieves only 74.9, while the score for Gemini 2.0 is 82.8!</p></blockquote><p>It can be seen that there is a significant performance gap between Gemma 3 and close-source models such as Gemini 2.0 for SimpleQA!</p><h4>Further resources</h4><p><a href="https://openai.com/index/introducing-simpleqa/">SimpleQA</a></p><p><a href="https://deepmind.google/discover/blog/facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-large-language-models/">FACTS Grounding</a></p><div><hr></div><h3>How to use image as inputs with Gemma 3?</h3><p>To use <strong>images as input with Gemma 3</strong>, Hugging Face provides a convenient way through the <code>pipeline</code> API using the <code>"image-text-to-text"</code> task. This allows you to pass a combination of images and text to multimodal variants of Gemma 3, such as <code>gemma-3-4b-it</code>, <code>gemma-3-12b-it</code>, or <code>gemma-3-27b-it</code>.</p><p>Here&#8217;s an example using a hosted image and a natural language prompt:</p><pre><code>import torch
from transformers import pipeline

pipe = pipeline(
    "image-text-to-text",
    model="google/gemma-3-4b-it", # "google/gemma-3-12b-it", "google/gemma-3-27b-it" 
    device="cuda",
    torch_dtype=torch.bfloat16
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    }
]

output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])</code></pre><div><hr></div><h3><strong>Gemma 1</strong></h3><p>If you would like remind yourself about Gemma 1, here are a quick recap:</p><ul><li><p><strong>Model sizes</strong>: 2B and 7B parameters (both pre-trained &amp; instruction-tuned)</p></li><li><p><strong>Context length</strong>: 8192 tokens</p></li><li><p><strong>Modality</strong>: Text-only, English-only</p></li><li><p><strong>Highlights</strong>:</p><ul><li><p>Surpassed <strong>LLaMA 2 (7B &amp; 13B)</strong> and <strong>Mistral 7B</strong> in many language tasks</p></li></ul></li></ul><h4>Further resources</h4><p><a href="https://arxiv.org/abs/2403.08295">Gemma: Open Models Based on Gemini Research and Technology</a></p><div><hr></div><h3><strong>Gemma 2</strong></h3><p>If you would like remind yourself about Gemma 2, here are a quick recap:</p><ul><li><p><strong>Model sizes</strong>: 2B &amp; 7B (via <strong>knowledge distillation</strong>), plus a <strong>27B model</strong> (trained from scratch)</p></li><li><p><strong>Context length</strong>: Still 8192 tokens</p></li><li><p><strong>Modality</strong>: Text-only, English-only</p></li><li><p><strong>Architecture Highlights</strong>:</p><ul><li><p>&#128313; <a href="#">Local + Global Attention Layers</a></p></li><li><p>&#128313; <a href="#">Grouped-Query Attention</a></p></li></ul></li><li><p><strong>Tokenizer</strong>: <a href="#">SentencePiece</a></p></li></ul><p>The 27B model in particular brought <strong>competitive performance</strong> with efficient scaling, while keeping the architecture relatively lightweight.</p><h4>Further resources</h4><p><a href="https://arxiv.org/abs/2408.00118">Gemma 2: Improving Open Language Models at a Practical Size</a></p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7uYn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b5b289-b14e-431e-8a7c-84e7ca590625_1800x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7uYn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b5b289-b14e-431e-8a7c-84e7ca590625_1800x300.png 424w, https://substackcdn.com/image/fetch/$s_!7uYn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b5b289-b14e-431e-8a7c-84e7ca590625_1800x300.png 848w, https://substackcdn.com/image/fetch/$s_!7uYn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b5b289-b14e-431e-8a7c-84e7ca590625_1800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!7uYn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b5b289-b14e-431e-8a7c-84e7ca590625_1800x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7uYn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b5b289-b14e-431e-8a7c-84e7ca590625_1800x300.png" width="1456" height="243" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b3b5b289-b14e-431e-8a7c-84e7ca590625_1800x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:243,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:121583,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.datascienceletter.com/i/160031640?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b5b289-b14e-431e-8a7c-84e7ca590625_1800x300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7uYn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b5b289-b14e-431e-8a7c-84e7ca590625_1800x300.png 424w, https://substackcdn.com/image/fetch/$s_!7uYn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b5b289-b14e-431e-8a7c-84e7ca590625_1800x300.png 848w, https://substackcdn.com/image/fetch/$s_!7uYn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b5b289-b14e-431e-8a7c-84e7ca590625_1800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!7uYn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b5b289-b14e-431e-8a7c-84e7ca590625_1800x300.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Before you go, here are the takeaways:</p><ul><li><p>Gemma 3 is a major upgrade over Gemma 2</p></li><li><p>A open-source model that enables full control for fine tuning, alignment, inference, or deployment.</p></li><li><p>Competing performance, e.g. on benchmarks or human-based evaluations, even in comparison with models with much more parameters such as LLaMA 3, Mistral, and even close-source models (GPT or Gemini)</p></li></ul><div><hr></div><h3>Resources</h3><p><a href="https://arxiv.org/abs/2403.08295">Gemma: Open Models Based on Gemini Research and Technology</a></p><p><a href="https://arxiv.org/abs/2408.00118">Gemma 2: Improving Open Language Models at a Practical Size</a></p><p><a href="https://arxiv.org/abs/2503.19786">Gemma 3 technical report</a></p><p><a href="https://huggingface.co/blog/gemma3">Huggingface blog on Gemma</a></p>]]></content:encoded></item><item><title><![CDATA[Industry Applications of AI, ML & Data Science]]></title><description><![CDATA[Explore practical case studies showing how AI and data science are driving innovation across industries.]]></description><link>https://newsletter.datascienceletter.com/p/industry-application</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/industry-application</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Thu, 27 Mar 2025 22:42:41 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/129d42ce-6fe1-4671-9d98-7b982af211c5_2667x4000.avif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Artificial intelligence and data science are no longer confined to research labs &#8212; they are shaping how industries operate, make decisions, and deliver value.</p><p>This section of <em>Data Science Letter</em> highlights real-world use cases that reveal the power of applied machine learning (ML), artificial intelligence (AI), and data science (DS) across sectors.</p><p>Each post dives into a specific industry scenario &#8212; decoding the problem, the data, the algorithms used, and the business outcomes. Whether you're an aspiring data scientist or an industry professional, these articles are designed to bridge the gap between theory and practice.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qAeC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ded903-05f3-4998-8d36-b6caa1c74138_1024x277.svg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qAeC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ded903-05f3-4998-8d36-b6caa1c74138_1024x277.svg 424w, https://substackcdn.com/image/fetch/$s_!qAeC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ded903-05f3-4998-8d36-b6caa1c74138_1024x277.svg 848w, https://substackcdn.com/image/fetch/$s_!qAeC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ded903-05f3-4998-8d36-b6caa1c74138_1024x277.svg 1272w, https://substackcdn.com/image/fetch/$s_!qAeC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ded903-05f3-4998-8d36-b6caa1c74138_1024x277.svg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qAeC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ded903-05f3-4998-8d36-b6caa1c74138_1024x277.svg" width="458" height="123.892578125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e4ded903-05f3-4998-8d36-b6caa1c74138_1024x277.svg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:277,&quot;width&quot;:1024,&quot;resizeWidth&quot;:458,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Fichier:Netflix 2015 logo.svg &#8212; Wikip&#233;dia&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Fichier:Netflix 2015 logo.svg &#8212; Wikip&#233;dia" title="Fichier:Netflix 2015 logo.svg &#8212; Wikip&#233;dia" srcset="https://substackcdn.com/image/fetch/$s_!qAeC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ded903-05f3-4998-8d36-b6caa1c74138_1024x277.svg 424w, https://substackcdn.com/image/fetch/$s_!qAeC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ded903-05f3-4998-8d36-b6caa1c74138_1024x277.svg 848w, https://substackcdn.com/image/fetch/$s_!qAeC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ded903-05f3-4998-8d36-b6caa1c74138_1024x277.svg 1272w, https://substackcdn.com/image/fetch/$s_!qAeC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ded903-05f3-4998-8d36-b6caa1c74138_1024x277.svg 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p><em><strong>Measuring Subscriber Value at Netflix: Causal Inference Meets Markov Chains</strong> <a href="https://newsletter.datascienceletter.com/p/beyond-customer-lifetime-valuation">link</a></em></p><blockquote><p>Topic: machine learning, causal inference, Markov chain</p></blockquote><p>Netflix doesn&#8217;t just acquire users &#8212; it quantifies the <em>long-term value</em> of each subscriber using advanced data science. This post explores how Netflix applies <strong>causal inference</strong> to isolate marketing impact and uses <strong>Markov chain modeling</strong> to predict retention behavior over time. Learn how these techniques guide smarter decisions in customer acquisition and lifecycle management.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kO36!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kO36!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!kO36!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!kO36!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!kO36!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kO36!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!kO36!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!kO36!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!kO36!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!kO36!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">&#128640; To receive new posts and support my work, consider becoming a free or paid subscriber</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[LightGBM vs XGBoost vs Catboost]]></title><description><![CDATA[Deep dive into similarities and differences of LightGBM, XGBoost and CatBoost in 10 minutes!]]></description><link>https://newsletter.datascienceletter.com/p/lightgbm-vs-xgboost-vs-catboost</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/lightgbm-vs-xgboost-vs-catboost</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Thu, 07 Nov 2024 21:43:21 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!2La8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8badda5a-59ec-4741-82e3-9a785ef8b5f9_1400x933.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2La8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8badda5a-59ec-4741-82e3-9a785ef8b5f9_1400x933.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2La8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8badda5a-59ec-4741-82e3-9a785ef8b5f9_1400x933.png 424w, https://substackcdn.com/image/fetch/$s_!2La8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8badda5a-59ec-4741-82e3-9a785ef8b5f9_1400x933.png 848w, https://substackcdn.com/image/fetch/$s_!2La8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8badda5a-59ec-4741-82e3-9a785ef8b5f9_1400x933.png 1272w, https://substackcdn.com/image/fetch/$s_!2La8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8badda5a-59ec-4741-82e3-9a785ef8b5f9_1400x933.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2La8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8badda5a-59ec-4741-82e3-9a785ef8b5f9_1400x933.png" width="1400" height="933" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8badda5a-59ec-4741-82e3-9a785ef8b5f9_1400x933.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:933,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2La8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8badda5a-59ec-4741-82e3-9a785ef8b5f9_1400x933.png 424w, https://substackcdn.com/image/fetch/$s_!2La8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8badda5a-59ec-4741-82e3-9a785ef8b5f9_1400x933.png 848w, https://substackcdn.com/image/fetch/$s_!2La8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8badda5a-59ec-4741-82e3-9a785ef8b5f9_1400x933.png 1272w, https://substackcdn.com/image/fetch/$s_!2La8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8badda5a-59ec-4741-82e3-9a785ef8b5f9_1400x933.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1><strong>Quick summary</strong></h1><p>Hello &#128075; In this article, I will compare LightGBM, XGBoost and CatBoost in the following areas:</p><ul><li><p>Boosting algorithm</p></li><li><p>Node splitting</p></li><li><p>Missing data handling</p></li><li><p>Feature handling</p></li><li><p>Data sampling</p></li><li><p>LightGBM-specific features</p></li><li><p>XGBoost-specific features</p></li><li><p>CatBoost-specific features</p></li><li><p>Tips for choosing between LightGBM, XGBoost and CatBoost</p></li><li><p>Resources</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QRwP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QRwP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 424w, https://substackcdn.com/image/fetch/$s_!QRwP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 848w, https://substackcdn.com/image/fetch/$s_!QRwP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 1272w, https://substackcdn.com/image/fetch/$s_!QRwP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QRwP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png" width="700" height="119" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:119,&quot;width&quot;:700,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!QRwP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 424w, https://substackcdn.com/image/fetch/$s_!QRwP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 848w, https://substackcdn.com/image/fetch/$s_!QRwP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 1272w, https://substackcdn.com/image/fetch/$s_!QRwP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><blockquote><p><em>&#128640;  Email us @ <a href="http://newsletter.verticalsolution.io/">social@verticalsolution.io</a> if you want us to deep dive into a data science topic!</em></p></blockquote><h1><strong>Boosting algorithm</strong></h1><blockquote><p><em>Conventional boosting (LightGBM, XGBoost) vs Order boosting (CatBoost)</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y89F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1090fa5-09d9-4a61-a82d-57aaa0ec89f9_1400x794.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y89F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1090fa5-09d9-4a61-a82d-57aaa0ec89f9_1400x794.png 424w, https://substackcdn.com/image/fetch/$s_!y89F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1090fa5-09d9-4a61-a82d-57aaa0ec89f9_1400x794.png 848w, https://substackcdn.com/image/fetch/$s_!y89F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1090fa5-09d9-4a61-a82d-57aaa0ec89f9_1400x794.png 1272w, https://substackcdn.com/image/fetch/$s_!y89F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1090fa5-09d9-4a61-a82d-57aaa0ec89f9_1400x794.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y89F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1090fa5-09d9-4a61-a82d-57aaa0ec89f9_1400x794.png" width="1400" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1090fa5-09d9-4a61-a82d-57aaa0ec89f9_1400x794.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!y89F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1090fa5-09d9-4a61-a82d-57aaa0ec89f9_1400x794.png 424w, https://substackcdn.com/image/fetch/$s_!y89F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1090fa5-09d9-4a61-a82d-57aaa0ec89f9_1400x794.png 848w, https://substackcdn.com/image/fetch/$s_!y89F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1090fa5-09d9-4a61-a82d-57aaa0ec89f9_1400x794.png 1272w, https://substackcdn.com/image/fetch/$s_!y89F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1090fa5-09d9-4a61-a82d-57aaa0ec89f9_1400x794.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>One of the major differences in tree building between LightGBM/XGBoost and CatBoost is the usage of &#8216;Order boosting&#8217; in CatBoost.</p><p>In conventional boosting algorithms (used by LightGBM and XGBoost), at each boosting iteration, the tree is built using the same data points. It is argued that this repeated use of a single set of data points and can increase the chance of overfitting.</p><p>To mitigate this effect, CatBoost supports a different boosting algorithm as known as order boosting. The whole idea of this algorithm is to avoid repeatedly using same data points for both tree building and gradient or hessian computations. The method is briefly explained as follows:</p><ol><li><p>First the original training dataset with size N is shuffled S times.</p></li><li><p>At each boosting iteration, for each shuffled dataset, a separate tree is built for each data position i (where i = 1, 2 ,&#8230;, N), using only data points before i (j &lt; i).</p></li><li><p>The gradients and hessians for a particular data point k are then computed using trees built before k.</p></li></ol><p>In reality, it is not practical to train a tree for each data position for each shuffled datasets, as the computational complexity would scale as SN&#178;. The actual algorithm builds trees for a fixed number of data positions K, reducing the complexity to SNK (usually K &lt;&lt; N, one example could be K = logN).</p><p>In addition, by default CatBoost only performs order boosting when the input dataset size is small. For large datasets, CatBoost will use regular boosting algorithm instead.</p><h1><strong>Tree growing</strong></h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NGXn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d612d89-319d-4fd5-8ca0-4a20009fe689_700x394.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NGXn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d612d89-319d-4fd5-8ca0-4a20009fe689_700x394.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NGXn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d612d89-319d-4fd5-8ca0-4a20009fe689_700x394.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NGXn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d612d89-319d-4fd5-8ca0-4a20009fe689_700x394.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NGXn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d612d89-319d-4fd5-8ca0-4a20009fe689_700x394.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NGXn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d612d89-319d-4fd5-8ca0-4a20009fe689_700x394.jpeg" width="700" height="394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d612d89-319d-4fd5-8ca0-4a20009fe689_700x394.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:394,&quot;width&quot;:700,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!NGXn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d612d89-319d-4fd5-8ca0-4a20009fe689_700x394.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NGXn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d612d89-319d-4fd5-8ca0-4a20009fe689_700x394.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NGXn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d612d89-319d-4fd5-8ca0-4a20009fe689_700x394.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NGXn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d612d89-319d-4fd5-8ca0-4a20009fe689_700x394.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>Tree growing with LightGBM</strong></h2><blockquote><p><em>LightGBM: leaf-wise / lossguide</em></p></blockquote><p>In terms of tree growing methods, LightGBM uses leaf-wise tree growth (also called <em>lossguide</em> in other implementation). In this case, a tree is built leaf by leaf until a maximum number of leaves is attained. In each node splitting iteration, a non-terminal leaf with the best loss improvement is split.</p><h2><strong>Tree growing with XGBoost</strong></h2><blockquote><p><em>XGBoost: depth-wise / lossguide</em></p></blockquote><p>XGBoost supports both leaf-wise and depth-wise tree growth, called <em>lossguide</em> and <em>depthwise</em> respectively. In this case, a tree is built level by level until a specific depth is reached. Non-terminal leaves from the previous tree level are split according to each best loss improvement.</p><h2><strong>Tree growing with CatBoost</strong></h2><blockquote><p><em>CatBoost: symmetric tree, depth-wise / lossguide</em></p></blockquote><p>CatBoost supports three options: <em>symmetric tree</em>, <em>depthwise</em> and <em>lossguide</em>. The mode SymmetricTree is slightly different from depthwise. While the mode <em>depthwise</em> splits each node with its respective best loss improvement, SymmetricTree splits all node in the same depth with the same splitting criterion.</p><h1><strong>Node splitting</strong></h1><h2><strong>Node splitting with LightGBM and XGBoost</strong></h2><p>At each boosting iteration, boosting algorithm finds the optimal splitting position giving the best loss improvement for each node to grow trees.</p><p>In each node splitting, the difference in loss before and after splitting is computed:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eh5K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecffe3d2-a6ca-489a-8d9a-c284606daf0e_646x47.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eh5K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecffe3d2-a6ca-489a-8d9a-c284606daf0e_646x47.png 424w, https://substackcdn.com/image/fetch/$s_!eh5K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecffe3d2-a6ca-489a-8d9a-c284606daf0e_646x47.png 848w, https://substackcdn.com/image/fetch/$s_!eh5K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecffe3d2-a6ca-489a-8d9a-c284606daf0e_646x47.png 1272w, https://substackcdn.com/image/fetch/$s_!eh5K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecffe3d2-a6ca-489a-8d9a-c284606daf0e_646x47.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eh5K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecffe3d2-a6ca-489a-8d9a-c284606daf0e_646x47.png" width="646" height="47" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ecffe3d2-a6ca-489a-8d9a-c284606daf0e_646x47.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:47,&quot;width&quot;:646,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!eh5K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecffe3d2-a6ca-489a-8d9a-c284606daf0e_646x47.png 424w, https://substackcdn.com/image/fetch/$s_!eh5K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecffe3d2-a6ca-489a-8d9a-c284606daf0e_646x47.png 848w, https://substackcdn.com/image/fetch/$s_!eh5K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecffe3d2-a6ca-489a-8d9a-c284606daf0e_646x47.png 1272w, https://substackcdn.com/image/fetch/$s_!eh5K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecffe3d2-a6ca-489a-8d9a-c284606daf0e_646x47.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The detail between each implementation relies on the definition of the loss function.</p><blockquote><p>LightGBM and XGBoost uses primarily second-order approximations with regularisation for computing splitting gain.</p></blockquote><p>In particular, for LightGBM, the splitting gain is defined as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xMHH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31c51e2d-7bb0-48fd-b52e-196b350424d1_700x108.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xMHH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31c51e2d-7bb0-48fd-b52e-196b350424d1_700x108.png 424w, https://substackcdn.com/image/fetch/$s_!xMHH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31c51e2d-7bb0-48fd-b52e-196b350424d1_700x108.png 848w, https://substackcdn.com/image/fetch/$s_!xMHH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31c51e2d-7bb0-48fd-b52e-196b350424d1_700x108.png 1272w, https://substackcdn.com/image/fetch/$s_!xMHH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31c51e2d-7bb0-48fd-b52e-196b350424d1_700x108.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xMHH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31c51e2d-7bb0-48fd-b52e-196b350424d1_700x108.png" width="700" height="108" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31c51e2d-7bb0-48fd-b52e-196b350424d1_700x108.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:108,&quot;width&quot;:700,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!xMHH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31c51e2d-7bb0-48fd-b52e-196b350424d1_700x108.png 424w, https://substackcdn.com/image/fetch/$s_!xMHH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31c51e2d-7bb0-48fd-b52e-196b350424d1_700x108.png 848w, https://substackcdn.com/image/fetch/$s_!xMHH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31c51e2d-7bb0-48fd-b52e-196b350424d1_700x108.png 1272w, https://substackcdn.com/image/fetch/$s_!xMHH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31c51e2d-7bb0-48fd-b52e-196b350424d1_700x108.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>For XGBoost, the splitting gain is defined as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a4ny!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ff1c38-8e57-4cab-a5bc-4c8df0017030_700x118.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a4ny!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ff1c38-8e57-4cab-a5bc-4c8df0017030_700x118.png 424w, https://substackcdn.com/image/fetch/$s_!a4ny!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ff1c38-8e57-4cab-a5bc-4c8df0017030_700x118.png 848w, https://substackcdn.com/image/fetch/$s_!a4ny!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ff1c38-8e57-4cab-a5bc-4c8df0017030_700x118.png 1272w, https://substackcdn.com/image/fetch/$s_!a4ny!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ff1c38-8e57-4cab-a5bc-4c8df0017030_700x118.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a4ny!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ff1c38-8e57-4cab-a5bc-4c8df0017030_700x118.png" width="700" height="118" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41ff1c38-8e57-4cab-a5bc-4c8df0017030_700x118.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:118,&quot;width&quot;:700,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!a4ny!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ff1c38-8e57-4cab-a5bc-4c8df0017030_700x118.png 424w, https://substackcdn.com/image/fetch/$s_!a4ny!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ff1c38-8e57-4cab-a5bc-4c8df0017030_700x118.png 848w, https://substackcdn.com/image/fetch/$s_!a4ny!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ff1c38-8e57-4cab-a5bc-4c8df0017030_700x118.png 1272w, https://substackcdn.com/image/fetch/$s_!a4ny!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ff1c38-8e57-4cab-a5bc-4c8df0017030_700x118.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Apart from regularisation, the formula of LightGBM and XGBoost look similar to one and other.</p><h2><strong>Node splitting with CatBoost</strong></h2><blockquote><p>CatBoost supports four types of methods to compute splitting gains.</p></blockquote><p>On the other hand, CatBoost supports four different splitting gain criteria, as known as following:</p><ul><li><p>L2, Cosine, NewtonL2 and NewtonCosine</p></li></ul><p>One point to note is that the option of NewtonL2 is similar to the splitting gain definition in LightGBM and XGBoost. For interested users, please visit this <a href="https://catboost.ai/en/docs/concepts/algorithm-score-functions#score-functions">page</a> for more information.</p><h1><strong>Missing data handling</strong></h1><h2><strong>Missing data with LightGBM and XGBoost</strong></h2><blockquote><p><em>During a node split, LightGBM and XGBoost assign the missing data to whichever direction giving the best split in loss.</em></p></blockquote><p>LightGBM and XGBoost handle missing data by assigning these data points to whichever direction giving the best split in loss. This allows the model to learn from the patterns of missing data.</p><p>This feature is quite useful when data imputation or data filtering based on missing values are difficult to do, for example in situations where there are a lot of missing columns.</p><p>There is one thing to be mindful of. If there is no missing data in a column during training, but in inference there are missing data, then LightGBM and XGBoost will assign those data points to their respective default directions. Default directions can vary case-by-case. Visit these two discussions (<a href="https://github.com/microsoft/LightGBM/issues/2921">here</a> and <a href="https://www.kaggle.com/competitions/higgs-boson/discussion/8184#53396">here</a>) on git if you are interested. In any case, in terms of model development, it is desirable to investigate on this train-test discrepancy as this can lead to significant degraded model performance from distribution drift.</p><h2><strong>Missing data with CatBoost</strong></h2><blockquote><p><em>CatBoost either assigns all missing data to the left, or the right, or forbids the presence of missing data.</em></p></blockquote><p>CatBoost does not provide an adaptive missing data handling as LightGBM or XGBoost. For numerical features, the following modes are supported:</p><ul><li><p>Forbidden &#8212; missing data is treated as an error</p></li><li><p>Min &#8212; missing data for a feature is treated as the minimum value for that feature during a node split</p></li><li><p>Max &#8212; missing data for a feature is treated as the maximum value for that feature during a node split</p></li></ul><p>In general, one has to perform more manuel to handle missing data when using CatBoost.</p><h1><strong>Categorical Feature handling</strong></h1><h2><strong>Categorical features handling with LightGBM</strong></h2><blockquote><p><em>LightGBM supports partitioning of categorical features.</em></p></blockquote><p>LightGBM handles categorical features by partitioning its categories into 2 subsets and choose the subset which gives the best split, as shown in the figure below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!J4Bv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbb1eaa-a5bb-49e6-b121-d3c6bbed0d1d_1400x788.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J4Bv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbb1eaa-a5bb-49e6-b121-d3c6bbed0d1d_1400x788.jpeg 424w, https://substackcdn.com/image/fetch/$s_!J4Bv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbb1eaa-a5bb-49e6-b121-d3c6bbed0d1d_1400x788.jpeg 848w, https://substackcdn.com/image/fetch/$s_!J4Bv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbb1eaa-a5bb-49e6-b121-d3c6bbed0d1d_1400x788.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!J4Bv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbb1eaa-a5bb-49e6-b121-d3c6bbed0d1d_1400x788.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J4Bv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbb1eaa-a5bb-49e6-b121-d3c6bbed0d1d_1400x788.jpeg" width="1400" height="788" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ebbb1eaa-a5bb-49e6-b121-d3c6bbed0d1d_1400x788.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:788,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!J4Bv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbb1eaa-a5bb-49e6-b121-d3c6bbed0d1d_1400x788.jpeg 424w, https://substackcdn.com/image/fetch/$s_!J4Bv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbb1eaa-a5bb-49e6-b121-d3c6bbed0d1d_1400x788.jpeg 848w, https://substackcdn.com/image/fetch/$s_!J4Bv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbb1eaa-a5bb-49e6-b121-d3c6bbed0d1d_1400x788.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!J4Bv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febbb1eaa-a5bb-49e6-b121-d3c6bbed0d1d_1400x788.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>One advantage is that a tree will not need to grow very deep to achieve similar accuracy.</p><h2><strong>Categorical features handling with XGBoost</strong></h2><blockquote><p><em>XGBoost supports partitioning of categorical features or one-hot encoding.</em></p></blockquote><p>Starting from version 1.5,<strong> </strong>XGBoost supports both one-hot encoding and partitioning.</p><h2><strong>Categorical features handling with CatBoost</strong></h2><blockquote><p><em>CatBoost transforms categorical features to numerical features by order target encoding.</em></p></blockquote><p>One of the major differences between CatBoost and other boosting algorithm implementations is the way CatBoost handles categorical features.</p><p><strong>Order target encoding</strong> &#8212; One major feature of CatBoost is to allow order target encoding. Normally, target encoding for categorical features can easily lead to data leakage as training labels are directly used as features. Instead, to reduce the risk of data leakage, CatBoost uses the same trick from order boosting datasets to compute target-encoded values for each data point.</p><p>For example, for binary classification problems, given a categorical feature with value as A, B, C, one example of the target-encoded value for A is:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gTs8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fe8eaca-2708-418e-bf8d-deb6d5cbc406_127x38.svg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gTs8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fe8eaca-2708-418e-bf8d-deb6d5cbc406_127x38.svg 424w, https://substackcdn.com/image/fetch/$s_!gTs8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fe8eaca-2708-418e-bf8d-deb6d5cbc406_127x38.svg 848w, https://substackcdn.com/image/fetch/$s_!gTs8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fe8eaca-2708-418e-bf8d-deb6d5cbc406_127x38.svg 1272w, https://substackcdn.com/image/fetch/$s_!gTs8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fe8eaca-2708-418e-bf8d-deb6d5cbc406_127x38.svg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gTs8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fe8eaca-2708-418e-bf8d-deb6d5cbc406_127x38.svg" width="1456" height="425" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5fe8eaca-2708-418e-bf8d-deb6d5cbc406_127x38.svg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:425,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gTs8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fe8eaca-2708-418e-bf8d-deb6d5cbc406_127x38.svg 424w, https://substackcdn.com/image/fetch/$s_!gTs8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fe8eaca-2708-418e-bf8d-deb6d5cbc406_127x38.svg 848w, https://substackcdn.com/image/fetch/$s_!gTs8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fe8eaca-2708-418e-bf8d-deb6d5cbc406_127x38.svg 1272w, https://substackcdn.com/image/fetch/$s_!gTs8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fe8eaca-2708-418e-bf8d-deb6d5cbc406_127x38.svg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>where countInA is the sum of binary target with class A in the dataset, and totalCount is the total count of the dataset.</p><h1><strong>Data sampling</strong></h1><p>At each boosting iteration, it is possible to select only a fraction of the training datasets to grow the tree. There are a lot of ways to sample datasets. Examples include:</p><p>&#128308; <strong>Uniform sampling</strong> means that each data point is independently sampled with a certain probability (a configurable parameter). For example, an uniform sampling of probability 0.3 will mean that at each iteration, roughly 30% of the training data will be used.</p><p>&#128308;<strong> Parametric sampling</strong> refers to sampling methods with sampling weights defined by a parametric function, such as exponential or poisson functions.</p><p>&#128308; <strong>Gradient-based sampling</strong> refers to sampling methods with sampling weights defined by gradients. The advantage is that data points with smaller gradient are in general well-trained, and hence should be downsampled. This can improve the training speed of the algorithm as some data points are dropped.</p><p>LightGBM, XGBoost and CatBoost supports different set of sampling methods. They will be explained below. Let&#8217;s start with LightGBM.</p><h2><strong>Data sampling method in LightGBM</strong></h2><blockquote><p><em>LightGBM supports uniform sampling and a gradient-based sampling method called GOSS.</em></p></blockquote><p>In LightGBM, the parameter to select sampling strategies is called <em>data_sample_strategy</em>. There are two options:</p><ul><li><p><em>bagging</em></p></li><li><p><em>goss</em></p></li></ul><p>Bagging refers to uniform sampling, with probability set by the parameter<em> bagging_fraction</em>.</p><p>Another option is called goss, which stands for gradient-based one-side sampling. In this method, top a% of data with the largest gradient values are kept, the remaining (1-a)% of data is downsampled. To avoid drift from original data distribution, the downsampled data are up-weighted in later process. This method ensures that we put more focus on the under-trained samples as the boosting iterations continue.</p><h2><strong>Data sampling method in XGBoost</strong></h2><blockquote><p><em>XGBoost supports uniform sampling and a gradient-based sampling method.</em></p></blockquote><p>In XGBoost, the parameter to select sampling strategies is called <em>sampling_method</em>. There are two options:</p><ul><li><p><em>uniform</em></p></li><li><p><em>gradient_based</em></p></li></ul><p>With the option <em>uniform</em>, one can set the probability with the parameter <em>subsample</em>.</p><p>The gradient-based sampling method supported by XGBoost samples data points according to the weight</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WFTO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1cd844d-cf07-4d08-a64e-f4a3f1ed8268_168x67.svg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WFTO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1cd844d-cf07-4d08-a64e-f4a3f1ed8268_168x67.svg 424w, https://substackcdn.com/image/fetch/$s_!WFTO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1cd844d-cf07-4d08-a64e-f4a3f1ed8268_168x67.svg 848w, https://substackcdn.com/image/fetch/$s_!WFTO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1cd844d-cf07-4d08-a64e-f4a3f1ed8268_168x67.svg 1272w, https://substackcdn.com/image/fetch/$s_!WFTO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1cd844d-cf07-4d08-a64e-f4a3f1ed8268_168x67.svg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WFTO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1cd844d-cf07-4d08-a64e-f4a3f1ed8268_168x67.svg" width="1456" height="578" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1cd844d-cf07-4d08-a64e-f4a3f1ed8268_168x67.svg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:578,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!WFTO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1cd844d-cf07-4d08-a64e-f4a3f1ed8268_168x67.svg 424w, https://substackcdn.com/image/fetch/$s_!WFTO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1cd844d-cf07-4d08-a64e-f4a3f1ed8268_168x67.svg 848w, https://substackcdn.com/image/fetch/$s_!WFTO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1cd844d-cf07-4d08-a64e-f4a3f1ed8268_168x67.svg 1272w, https://substackcdn.com/image/fetch/$s_!WFTO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1cd844d-cf07-4d08-a64e-f4a3f1ed8268_168x67.svg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>where g and h are the gradient and hessian respectively, and lambda is the regularisation parameter. Note that this is a different gradient-based sampling method than the one supported by LightGBM above.</p><h2><strong>Data sampling method in CatBoost</strong></h2><blockquote><p><em>CatBoost supports four types of sampling strategies: Bernoulli, Bayesian, Poisson and MVS.</em></p></blockquote><p>In Catboost, the parameter to select sampling strategies is called <em>bootstrap_type</em>. There are four options:</p><ul><li><p>Bayesian</p></li><li><p>Bernoulli</p></li><li><p>MVS</p></li><li><p>Poisson</p></li></ul><p>Despite the difference in name, the Bernoulli sampling method is actually uniform sampling. One can control the probability of sampling with the parameter <em>subsample</em>.</p><p>The Bayesian sampling method samples data points with weights defined as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-yh8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44878d8-fda7-415b-8264-0d8a06238048_597x44.svg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-yh8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44878d8-fda7-415b-8264-0d8a06238048_597x44.svg 424w, https://substackcdn.com/image/fetch/$s_!-yh8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44878d8-fda7-415b-8264-0d8a06238048_597x44.svg 848w, https://substackcdn.com/image/fetch/$s_!-yh8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44878d8-fda7-415b-8264-0d8a06238048_597x44.svg 1272w, https://substackcdn.com/image/fetch/$s_!-yh8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44878d8-fda7-415b-8264-0d8a06238048_597x44.svg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-yh8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44878d8-fda7-415b-8264-0d8a06238048_597x44.svg" width="1456" height="107" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e44878d8-fda7-415b-8264-0d8a06238048_597x44.svg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:107,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!-yh8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44878d8-fda7-415b-8264-0d8a06238048_597x44.svg 424w, https://substackcdn.com/image/fetch/$s_!-yh8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44878d8-fda7-415b-8264-0d8a06238048_597x44.svg 848w, https://substackcdn.com/image/fetch/$s_!-yh8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44878d8-fda7-415b-8264-0d8a06238048_597x44.svg 1272w, https://substackcdn.com/image/fetch/$s_!-yh8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44878d8-fda7-415b-8264-0d8a06238048_597x44.svg 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where w is the weight for each data point. The parameter t is defined the <em>bagging_temperature</em> parameter. This is in fact the Bayesian Bootstrap method proposed by D. Rubin in 1981.</p><p>The poisson sampling method samples data points with weights defined as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wT0C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ea42fe-733b-4928-81a3-4c05ffa45a89_427x37.svg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wT0C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ea42fe-733b-4928-81a3-4c05ffa45a89_427x37.svg 424w, https://substackcdn.com/image/fetch/$s_!wT0C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ea42fe-733b-4928-81a3-4c05ffa45a89_427x37.svg 848w, https://substackcdn.com/image/fetch/$s_!wT0C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ea42fe-733b-4928-81a3-4c05ffa45a89_427x37.svg 1272w, https://substackcdn.com/image/fetch/$s_!wT0C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ea42fe-733b-4928-81a3-4c05ffa45a89_427x37.svg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wT0C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ea42fe-733b-4928-81a3-4c05ffa45a89_427x37.svg" width="1456" height="127" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d3ea42fe-733b-4928-81a3-4c05ffa45a89_427x37.svg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:127,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!wT0C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ea42fe-733b-4928-81a3-4c05ffa45a89_427x37.svg 424w, https://substackcdn.com/image/fetch/$s_!wT0C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ea42fe-733b-4928-81a3-4c05ffa45a89_427x37.svg 848w, https://substackcdn.com/image/fetch/$s_!wT0C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ea42fe-733b-4928-81a3-4c05ffa45a89_427x37.svg 1272w, https://substackcdn.com/image/fetch/$s_!wT0C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ea42fe-733b-4928-81a3-4c05ffa45a89_427x37.svg 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Finally, the MVS (Minimum Variance Sampling) method is a gradient-based sampling method. MVS is in some ways similar to GOSS in LightGBM. The differences between this method and the one in LightGBM are:</p><ul><li><p>A threshold is determined on the fly to determine above what values gradients are considered to be large, while in LightGBM a fixed percentage are pre-determined before training.</p></li><li><p>Data samples with small gradients are weighted with a constant factor in LightGBM, while with MVS they are weighted with a factor proportional to the gradient value.</p></li></ul><p>On theoretical basis, MVS usually gives a lower variance than GOSS. For interested readers, feel free to visit this <a href="https://arxiv.org/abs/1910.13204">paper</a> for more detail.</p><h1><strong>LightGBM-specific features</strong></h1><h2><strong>Pairwise linear regression</strong></h2><p>Instead of using aggregated statistics, e.g. mean, median, in each leaf for predictions, LightGBM can perform linear regressions within each leaf and use these linear model to generate predictions instead. This function can be enabled by the argument <code>linear_tree</code>. This feature has been implemented since Dec 2020 (see the merge request <a href="https://github.com/microsoft/LightGBM/pull/3299">here</a>)</p><h2><strong>Exclusive feature bundling</strong></h2><p>This method aims to reduce the number of sparse features during tree growing by regrouping mutually exclusive sparse features into bundles. This feature aims to reduce memory usage and processing speed of the algorithm. Please refer to the original <a href="https://papers.nips.cc/paper_files/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html">paper</a> for more detail.</p><h1><strong>XGBoost-specific features</strong></h1><h2><strong>Survival-analysis</strong></h2><p>XGBoost supports custom loss functions for various survival analysis. One key difference between ordinary regression and survival analysis is that target labels in survival analysis can be are often represented by a range instead of point estimates, as the upper limit of the survival time can be infinite if the event has not occurred yet. Below shows an example of how to use define and use survival-analysis-specific loss functions in XGBoost.</p><pre><code>import numpy as np
import os
import pandas as pd
from sklearn.model_selection import ShuffleSplit

import xgboost as xgb

# Read data
df = pd.read_csv('data.csv')
y_lower_bound = df['y_lower_bound']
y_upper_bound = df['y_upper_bound']
X = df.drop(['y_lower_bound', 'y_upper_bound'], axis=1)

# Train test split
rs = ShuffleSplit(n_splits=2, test_size=.7, random_state=0)
train_index, valid_index = next(rs.split(X))

# Construct train dataset
dtrain = xgb.DMatrix(X.values[train_index, :])
dtrain.set_float_info('y_lower_bound', y_lower_bound[train_index])
dtrain.set_float_info('y_upper_bound', y_upper_bound[train_index])

# Construct valid dataset
dvalid = xgb.DMatrix(X.values[valid_index, :])
dvalid.set_float_info('y_lower_bound', y_lower_bound[valid_index])
dvalid.set_float_info('y_upper_bound', y_upper_bound[valid_index])

# Perform training
params = {
  'verbosity': 0,
  'objective': 'survival:aft',
  'eval_metric': 'aft-nloglik',
  'aft_loss_distribution': 'normal',
  'aft_loss_distribution_scale': 1.20,
  'tree_method': 'hist',
  'learning_rate': 0.01,
  'max_depth': 6,
  'lambda': 0.01,
  'alpha': 0.02
}
bst = xgb.train(
  params, dtrain, 
  num_boost_round=10000,
  evals=[(dtrain, 'train'), (dvalid, 'valid')],
  early_stopping_rounds=50,
)</code></pre><h1><strong>How to choose between LightGBM, XGBoost and CatBoost</strong></h1><p>Here are a few aspects to be considered when choosing between LightGBM, XGBoost, and CatBoost:</p><h2><strong>Memory storage and processing speed</strong></h2><p>LightGBM is in general a faster implementation among the three. If memory storage or training speed is a concern (e.g. when you have a dataframe using a large fraction of your memory already), it is probably good to try first LightGBM.</p><h2><strong>Missing data handling</strong></h2><p>LightGBM and XGBoost can assign data points with missing values that best optimise the objective function, while CatBoost does not support this feature. If your dataset has a lot of missing values and you would like to let the algorithm figure out its dependency by itself, it is probably best to use LightGBM and XGBoost.</p><p>Both LightGBM and XGBoost uses partitioning by default so it is not needed to do anything before training. (Note that this method only works for missing values in feature columns. In general it does not make sense to have missing values in target columns)</p><h2><strong>Overfitting</strong></h2><p>Regularisation parameters between the three implementations are more or less similar. In each implementation, there are a few common parameters that we can use to regularise the training and control overfitting, here are a few examples:</p><ul><li><p><em>Number of leaves</em> &#10145;&#65039; Maximum number of leaves in each tree (The higher this number, the more complex the trees will become)</p></li><li><p><em>Maximum depth</em> &#10145;&#65039; Maximum number of depth in each tree (The higher this number, the more complex the trees will become)</p></li><li><p><em>Data sampling fraction</em> &#10145;&#65039; The fraction of training data to be used in each boosting iteration. This reduces the likelihood of having a small portion of the training data dominating the training process and leads to overfitting</p></li><li><p>Data sampling frequency &#10145;&#65039; The frequency of which the training data will be sampled</p></li><li><p>Feature fraction &#10145;&#65039; The fraction of training data to be used in each boosting iteration</p></li><li><p>Early stopping &#10145;&#65039; Stop training whenever the performance does not improve for a while</p></li><li><p>Lambda L1 or L2 &#10145;&#65039; These few regularisation control the complexity of the trees (The higher these numbers, the less complex the trees are)</p></li><li><p>dropout rate for DART (LightGBM or XGBoost) &#10145;&#65039; During training, a fraction of trained trees are dropped at a certain probability. This helps regularise the whole algorithm as this removes over-dependency on a few trees.</p></li></ul><p>For relatively small datasets with which overfitting is often a problem, one can consider to use order boosting in CatBoost.</p><h2><strong>Categorical feature handling</strong></h2><p>Both LightGBM and XGBoost support the functionalities to use one-hot encoding or partitioning for categorical features during node splitting, while CatBoost relies on order target encoding.</p><p>Target encoding works better when target labels have strong correlations with encoding categories. The disadvantage is that this technique is in general more prone to overfitting.</p><h2><strong>Linear model</strong></h2><p>Both LightGBM and XGBoost support linear models for predictions, instead of simple aggregated statistics, such as mean of target labels, from the leaves. This gives more flexibility for model developments.</p><h1><strong>Resources</strong></h1><h2>Basics on gradient boosting</h2><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;8fe3f973-d261-4450-ab7b-3a6bdef0ee86&quot;,&quot;caption&quot;:&quot;What is boosting?&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Concepts of boosting algorithms in machine learning&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:217027086,&quot;name&quot;:&quot;Vertical Solution&quot;,&quot;bio&quot;:&quot;We are a group of data scientists in the tech sector, passionate about solving problems with data. We came from various background, such as particle physics, computational biology, computer science. We are also passionate about writing data science.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cb522d0-4d41-41ac-81a6-7bce18164a16_300x300.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-19T11:11:40.940Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ba19bef9-a9be-4e7d-ad40-c8f87e614f2d_1080x720.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://newsletter.verticalsolution.io/p/boosting-algorithms-in-machine-learning&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142752926,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Vertical Solution&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b7ae4f2-591a-4737-a685-8db7e0968e99_300x300.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h2><strong>LightGBM</strong></h2><p><a href="https://lightgbm.readthedocs.io/en/latest/index.html">Documentation</a> <a href="https://papers.nips.cc/paper_files/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html">Paper</a></p><h2><strong>XGBoost</strong></h2><p><a href="https://xgboost.readthedocs.io/en/stable/index.html">Documentation</a> <a href="https://arxiv.org/abs/1603.02754">Paper</a></p><h2><strong>CatBoost</strong></h2><p><a href="https://catboost.ai/en/docs/">Documentation</a> <a href="https://catboost.ai/en/docs/concepts/educational-materials-papers">Papers</a></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QRwP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QRwP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 424w, https://substackcdn.com/image/fetch/$s_!QRwP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 848w, https://substackcdn.com/image/fetch/$s_!QRwP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 1272w, https://substackcdn.com/image/fetch/$s_!QRwP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QRwP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png" width="700" height="119" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:119,&quot;width&quot;:700,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!QRwP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 424w, https://substackcdn.com/image/fetch/$s_!QRwP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 848w, https://substackcdn.com/image/fetch/$s_!QRwP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 1272w, https://substackcdn.com/image/fetch/$s_!QRwP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a33c1a-a7d2-4217-a9b5-33e12904e8d7_700x119.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><em>&#128640;  Thank you for reading! Give us some feedback @ <a href="http://newsletter.verticalsolution.io/">social@verticalsolution.io</a></em></p></blockquote>]]></content:encoded></item><item><title><![CDATA[Practical guides on Catboost]]></title><description><![CDATA[What exactly is Catboost? How does it differ from LightGBM and XGBoost?]]></description><link>https://newsletter.datascienceletter.com/p/practical-guides-on-catboost</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/practical-guides-on-catboost</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Tue, 27 Aug 2024 10:37:37 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f5b076b1-597b-454f-944f-ca4463c819a0_2070x1380.avif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tiuq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tiuq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tiuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!tiuq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p><em>&#128640; Subscribe to us for deep-dive content in DS, ML or AI! &#128640;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.datascienceletter.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CDfY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a89b0b-23bc-4854-bad3-52fdcba14e63_559x235.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CDfY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a89b0b-23bc-4854-bad3-52fdcba14e63_559x235.png 424w, https://substackcdn.com/image/fetch/$s_!CDfY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a89b0b-23bc-4854-bad3-52fdcba14e63_559x235.png 848w, https://substackcdn.com/image/fetch/$s_!CDfY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a89b0b-23bc-4854-bad3-52fdcba14e63_559x235.png 1272w, https://substackcdn.com/image/fetch/$s_!CDfY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a89b0b-23bc-4854-bad3-52fdcba14e63_559x235.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CDfY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a89b0b-23bc-4854-bad3-52fdcba14e63_559x235.png" width="559" height="235" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5a89b0b-23bc-4854-bad3-52fdcba14e63_559x235.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:235,&quot;width&quot;:559,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;GitHub - catboost/catboost: A fast, scalable, high performance Gradient  Boosting on Decision Trees library, used for ranking, classification,  regression and other machine learning tasks for Python, R, Java, C++.  Supports computation on&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="GitHub - catboost/catboost: A fast, scalable, high performance Gradient  Boosting on Decision Trees library, used for ranking, classification,  regression and other machine learning tasks for Python, R, Java, C++.  Supports computation on" title="GitHub - catboost/catboost: A fast, scalable, high performance Gradient  Boosting on Decision Trees library, used for ranking, classification,  regression and other machine learning tasks for Python, R, Java, C++.  Supports computation on" srcset="https://substackcdn.com/image/fetch/$s_!CDfY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a89b0b-23bc-4854-bad3-52fdcba14e63_559x235.png 424w, https://substackcdn.com/image/fetch/$s_!CDfY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a89b0b-23bc-4854-bad3-52fdcba14e63_559x235.png 848w, https://substackcdn.com/image/fetch/$s_!CDfY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a89b0b-23bc-4854-bad3-52fdcba14e63_559x235.png 1272w, https://substackcdn.com/image/fetch/$s_!CDfY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a89b0b-23bc-4854-bad3-52fdcba14e63_559x235.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><h2>Table of content</h2><ol><li><p>Quick summary</p></li><li><p>Concepts for the algorithm</p></li><li><p>General and Catboost-specific parameters</p></li><li><p>Practical considerations</p></li><li><p>More resources</p></li></ol><div><hr></div><h2>Quick summary</h2><p>Catboost is another powerful library for gradient tree boosting, similar to LightGBM and XGBoost.</p><p>Compared to other boosting implementations (LightGBM and XGBoost), its special  features are:</p><ul><li><p>Order boosting (a slightly modified gradient boosting algorithm)</p></li><li><p>Advanced categorical variable handling</p></li><li><p>Feature combination</p></li><li><p>Symmetric trees</p></li></ul><p>We will elaborate on these concepts in the following sessions.</p><div><hr></div><h2>Concepts for the algorithm</h2><p>The following assumes that you are already familiar with how gradient boosting works. Otherwise for a brief review, please visit <a href="https://verticalsolution.substack.com/p/boosting-algorithms-in-machine-learning">Concepts of boosting algorithms in machine learning</a>.</p><h4>Order boosting</h4><blockquote><p>One source of overfitting &#8212; Repeated use of the same dataset for each boosting iteration can lead to overfitting and hence prediction shift.</p></blockquote><p>In many gradient boosting implementations (such as LightGBM and XGBoost), the whole training dataset is used to compute residuals, gradients or hessians based on the intermediate model from the previous boosting iterations. This repeated use of the same data points is one of the sources for overfitting.</p><blockquote><p>Order boosting &#8212; a technique to build trees in each iteration by permutating training samples.</p></blockquote><p>To mitigate this problem, Catboost uses order boosting.</p><p>First, the original training dataset is randomly permutated for S times. At each iteration, Catboost build a list of new trees for each data position i, using only data point before the position i.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n4vE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838bc2a5-44af-4987-a32f-a89fd54f8339_8000x4500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n4vE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838bc2a5-44af-4987-a32f-a89fd54f8339_8000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!n4vE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838bc2a5-44af-4987-a32f-a89fd54f8339_8000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!n4vE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838bc2a5-44af-4987-a32f-a89fd54f8339_8000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!n4vE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838bc2a5-44af-4987-a32f-a89fd54f8339_8000x4500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n4vE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838bc2a5-44af-4987-a32f-a89fd54f8339_8000x4500.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/838bc2a5-44af-4987-a32f-a89fd54f8339_8000x4500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:777652,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!n4vE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838bc2a5-44af-4987-a32f-a89fd54f8339_8000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!n4vE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838bc2a5-44af-4987-a32f-a89fd54f8339_8000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!n4vE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838bc2a5-44af-4987-a32f-a89fd54f8339_8000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!n4vE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838bc2a5-44af-4987-a32f-a89fd54f8339_8000x4500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This conceptual implementation scales with O(SN^2), where N is the number of data points. This is not practical for large datasets. In reality, Catboost applies order boosting when the number of training data is small. In addition, instead of maintaining a list of tree for all data position, it maintains a list from position i = 1, &#8230; , log(n) instead. This significantly reduces the complexity to O(SN) instead. </p><div><hr></div><h4>Advanced categorical variable handling</h4><blockquote><p>The implementation in Catboost to handle categorical variables are more elaborated than its counterparts.</p></blockquote><p>Catboost supports several techniques to handle categorical variables, such as target encoding with ordering and feature combination.</p><p><strong>Target encoding with ordering</strong> &#8212; One of the ways to handle categorical variables are by target encoding. However, this technique often leads to overfitting as the label values are used directly in feature preparation. Similar to the technique of order boosting in the previous section, Catboost provides a solution to this problem by computing target-encoded values (usually the average target values for a particular category) using the data points prior to the position considered, as shown in the figure below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uoaG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe89f9bc9-bb03-44d5-917d-e9a906657b81_8000x4500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uoaG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe89f9bc9-bb03-44d5-917d-e9a906657b81_8000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!uoaG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe89f9bc9-bb03-44d5-917d-e9a906657b81_8000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!uoaG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe89f9bc9-bb03-44d5-917d-e9a906657b81_8000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!uoaG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe89f9bc9-bb03-44d5-917d-e9a906657b81_8000x4500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uoaG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe89f9bc9-bb03-44d5-917d-e9a906657b81_8000x4500.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e89f9bc9-bb03-44d5-917d-e9a906657b81_8000x4500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:811661,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uoaG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe89f9bc9-bb03-44d5-917d-e9a906657b81_8000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!uoaG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe89f9bc9-bb03-44d5-917d-e9a906657b81_8000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!uoaG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe89f9bc9-bb03-44d5-917d-e9a906657b81_8000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!uoaG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe89f9bc9-bb03-44d5-917d-e9a906657b81_8000x4500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Feature combination</strong> &#8212; In addition, Catboost also combines different categorical features. For example, if there are two categorical features in the dataset: pet (with value as dog and cat) and color (with value as white, and black), Catboost would make new categories such as <em>white dog, white cat, black dog, black cat</em>. It is important to note that Catboost performs the categorical combination at each boosting iteration and dynamically adds categorical combination incrementally to avoid sparsity.</p><blockquote><p>Feature combination allows more complex combinations of categorical values while maintaining reasonable training speed.</p></blockquote><div><hr></div><h4>Symmetric trees</h4><p>Unlike XGBoost and LightGBM, Catboost supports the feature to use symmetric trees as its base predictors. At each depth, Catboost ensures that all leaf nodes uses the same splitting rule. One advantage of this implementation is that this provides a faster inference speed. It also further avoids overfitting by constructing trees with simpler structures.</p><blockquote><p>Symmetric trees &#8212; the splitting condition is the same at each depth</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wIkN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87c1873-2879-417a-9cf4-c4132b109577_4171x3328.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wIkN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87c1873-2879-417a-9cf4-c4132b109577_4171x3328.png 424w, https://substackcdn.com/image/fetch/$s_!wIkN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87c1873-2879-417a-9cf4-c4132b109577_4171x3328.png 848w, https://substackcdn.com/image/fetch/$s_!wIkN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87c1873-2879-417a-9cf4-c4132b109577_4171x3328.png 1272w, https://substackcdn.com/image/fetch/$s_!wIkN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87c1873-2879-417a-9cf4-c4132b109577_4171x3328.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wIkN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87c1873-2879-417a-9cf4-c4132b109577_4171x3328.png" width="728" height="580.8640613761688" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b87c1873-2879-417a-9cf4-c4132b109577_4171x3328.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:3328,&quot;width&quot;:4171,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:490679,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wIkN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87c1873-2879-417a-9cf4-c4132b109577_4171x3328.png 424w, https://substackcdn.com/image/fetch/$s_!wIkN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87c1873-2879-417a-9cf4-c4132b109577_4171x3328.png 848w, https://substackcdn.com/image/fetch/$s_!wIkN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87c1873-2879-417a-9cf4-c4132b109577_4171x3328.png 1272w, https://substackcdn.com/image/fetch/$s_!wIkN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87c1873-2879-417a-9cf4-c4132b109577_4171x3328.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example of a symmetric tree</figcaption></figure></div><div><hr></div><h4>Advanced bootstrapping strategy</h4><p>Compared to LightGBM and XGBoost, Catboost supports a wider range of bootstrapping strategies during tree building, which includes</p><ul><li><p>Bayesian, Bernoulli, MVS, Poisson</p></li></ul><p>One can also define the <em>sampling unit</em> and <em>sampling frequency</em>.</p><p>On the other hand, LightGBM and XGBoost support</p><ul><li><p>Bagging, GOSS (For a description on GOSS, please visit <a href="http://Practical guides on LightGBM">here</a>)</p></li><li><p>Uniform, Gradient-based</p></li></ul><p>respectively.</p><div><hr></div><h2>Parameters</h2><h4>General parameters (similar to LightGBM and XGBoost)</h4><p>Catboost supports various parameters to control and tune training processes specific to boosting. </p><p>These parameters have similar counterparts in other implementations (LightGBM, XGBoost), and their meanings are easily understood given a basic understanding of gradient boosting (Please visit <a href="https://verticalsolution.substack.com/p/boosting-algorithms-in-machine-learning">here</a> for a deep-dive into gradient boosting).</p><p>The most common ones are as follows:</p><ul><li><p><strong>loss_function</strong> &#8212; the training objective for the problem</p></li><li><p><strong>learning_rate</strong> &#8212; the numerical factor multiplied to each predictor for each boosting iteration</p></li><li><p><strong>iterations</strong> &#8212; total number of boosting round to be performed.</p></li><li><p><strong>random_seed </strong>&#8212; initial random seed used for training.</p></li><li><p><strong>l2_leaf_reg &#8212; </strong>L2 regularization term during tree building</p></li><li><p><strong>subsample &#8212; </strong>bagging rate during tree building, used when the bootstrap type is Poisson, Bernoulli and MVS<strong>.</strong></p></li><li><p><strong>colsample_bylevel &#8212; </strong>fraction of features to be used during each tree split.</p></li><li><p><strong>min_data_in_leaf &#8212; </strong>minimum number of data required in each leaf.</p></li><li><p><strong>max_leaves &#8212; maximum number of leaves</strong></p></li><li><p><strong>grow_policy &#8212; </strong>a parameter to choose between </p><ul><li><p>SymmetricTree (All leaves from previous depth are splitted with the same condition)</p></li><li><p>Depthwise (All non-terminal leaves are splitted with different conditions)</p></li><li><p> Lossguide (The non-terminal leaf with the best loss improvement is split)</p></li></ul></li><li><p><strong>leaf_estimation_method &#8212; </strong>The method used to calculate the prediction from a leaf. It is a parameter to choose between:</p><ul><li><p>Newton</p></li><li><p>Gradient</p></li><li><p>Exact</p></li></ul></li></ul><div><hr></div><h4>Specific parameters to Catboost</h4><blockquote><p>bootstrap_type</p><p>Catboost supports a wide variety of sampling methods during splitting. These methods control the weights used for each training sample at each split, This can partly reduce the effect of overfitting during training.</p></blockquote><p>Specifically, Catboost supports the following parameters:</p><ul><li><p><strong>Bayesian</strong> &#8212; the weight of each training sample is calculated by:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w = (-\\log{\\phi})^t&quot;,&quot;id&quot;:&quot;SXXFUBHNCW&quot;}" data-component-name="LatexBlockToDOM"></div><p>where &#966; is a random number sampled from the uniform distribution [0,1]. The parameter t is the bagging temperature and can be defined by setting <em>bagging_temperature</em> in Catboost.</p></li><li><p><strong>Bernoulli</strong> &#8212; this is the typical bagging method also used in LightGBM and XGBoost. The rate of sampling is defined by the parameter <em>subsample</em> as described in the previous section</p></li><li><p><strong>MVS</strong> &#8212; Similar to GOSS in LightGBM, this is a bagging method which samples training samples based on gradient values. For more detail about this complex algorithm, please visit the paper <a href="https://arxiv.org/abs/1706.09516">here</a>.</p></li><li><p><strong>Poisson</strong> &#8212; the weight of each training sample is calculated by:</p></li></ul><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w = -\\log{(1-\\mathrm{subsample})}&quot;,&quot;id&quot;:&quot;LMVPFLPYSQ&quot;}" data-component-name="LatexBlockToDOM"></div><ul><li><p>No</p></li></ul><p>Please note that one can also control the frequency of sampling with the parameter <em>sampling_frequency</em>:</p><ul><li><p>PerTree &#8212; weights are determined at each construction of new tree</p></li><li><p>PerTreeLevel &#8212; weights are determined at each split</p></li></ul><blockquote><p>nan_mode</p><p>This is a parameter to choose the strategy to handle missing data during training. While there are default ways to handle missing data in LightGBM, in Catboost we can choose explicitly this strategy.</p></blockquote><p>In particular the following options are provided:</p><ul><li><p><strong>Forbidden</strong> &#8212; does not allow missing value, and raise error when encounter so.</p></li><li><p><strong>Min</strong> &#8212; put all missing values during a split to the left side</p></li><li><p><strong>Max</strong> &#8212; put all missing values during a split to the right side</p></li></ul><blockquote><p>boosting_type</p><p>This is a parameter to choose the boosting algorithm.</p></blockquote><p>In particular, one can choose between:</p><ul><li><p><strong>Ordered</strong> &#8212; Choosing this will enable Order Boosting (as explained in the previous sessions)</p></li><li><p><strong>Plain</strong> &#8212; This will enable the classic gradient boosting scheme.</p><div><hr></div></li></ul><h2><strong>Practical considerations</strong></h2><blockquote><p>Catboost is in general slower but performs better with small datasets, and categorical variable handling is more elaborated.</p></blockquote><p>Compared to LightGBM and XGBoost, the speed of Catboost is generally slower. With small datasets, where the processing speed is not the primary concern, using Catboost with Order Boosting enabled could give a better performance.</p><p>In addition, Catboost can give significant advantages over its counterparts in situations which the effects of feature interactions are significant. It can also be a good choice with training data with complex categorical variables.</p><div><hr></div><h2>Conclusion</h2><p>In many ways, Catboost is similar to LightGBM and XGBoost. In many ways, it is different because there are additional features built in this package that its counterparts do not have. This article highlights these differences and hope to help you feel more comfortable when using this package. Stay tuned for our next deep-dive article!</p><div><hr></div><h2><strong>More resources</strong></h2><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;72073a08-3593-4a22-ada7-1198f0d6fac4&quot;,&quot;caption&quot;:&quot;What is boosting?&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Concepts of boosting algorithms in machine learning&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:217027086,&quot;name&quot;:&quot;Vertical Solution&quot;,&quot;bio&quot;:&quot;We are a group of data scientists in the tech sector, passionate about solving problems with data. We came from various background, such as particle physics, computational biology, computer science. We are also passionate about writing data science.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cb522d0-4d41-41ac-81a6-7bce18164a16_300x300.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-19T11:11:40.940Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ba19bef9-a9be-4e7d-ad40-c8f87e614f2d_1080x720.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://newsletter.verticalsolution.io/p/boosting-algorithms-in-machine-learning&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142752926,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Vertical Solution&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b7ae4f2-591a-4737-a685-8db7e0968e99_300x300.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;e87090c9-de2c-4ed0-8fc0-ed6d14ab317d&quot;,&quot;caption&quot;:&quot;To access more premium content, subscribe to paid version of the newsletter!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Practical guides on XGBoost&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:217027086,&quot;name&quot;:&quot;Vertical Solution&quot;,&quot;bio&quot;:&quot;We are a group of data scientists in the tech sector, passionate about solving problems with data. We came from various background, such as particle physics, computational biology, computer science. We are also passionate about writing data science.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cb522d0-4d41-41ac-81a6-7bce18164a16_300x300.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-24T21:18:38.413Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b2c01312-8199-442d-ae49-d22cd48ff5e8_1080x720.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://newsletter.verticalsolution.io/p/practical-guides-on-xgboost&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142797756,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Vertical Solution&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b7ae4f2-591a-4737-a685-8db7e0968e99_300x300.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;401aa0d5-0eab-47c1-a3a5-ff1e3ead855f&quot;,&quot;caption&quot;:&quot;To access more premium content, subscribe to paid version of the newsletter!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Practical guides on LightGBM&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:217027086,&quot;name&quot;:&quot;Vertical Solution&quot;,&quot;bio&quot;:&quot;We are a group of data scientists in the tech sector, passionate about solving problems with data. We came from various background, such as particle physics, computational biology, computer science. We are also passionate about writing data science.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cb522d0-4d41-41ac-81a6-7bce18164a16_300x300.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-26T15:59:37.656Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a07e47a0-d214-4b7f-a20f-01995e6dac31_1080x721.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://newsletter.verticalsolution.io/p/practical-guides-on-lightgbm&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142926222,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Vertical Solution&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b7ae4f2-591a-4737-a685-8db7e0968e99_300x300.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;64af0a63-1303-4f76-8f8d-15ba686c7e23&quot;,&quot;caption&quot;:&quot;&#128640; Subscribe to us to more deep-dive topics on data science, machine learning and artificial intelligence!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Practical guides of boosting in scikit-learn&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:217027086,&quot;name&quot;:&quot;Vertical Solution&quot;,&quot;bio&quot;:&quot;We are a group of data scientists in the tech sector, passionate about solving problems with data. We came from various background, such as particle physics, computational biology, computer science. We are also passionate about writing data science.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cb522d0-4d41-41ac-81a6-7bce18164a16_300x300.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-19T23:40:18.390Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/001ecc64-55c6-4b78-8e0e-0d2a06adc4be_1080x721.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://newsletter.verticalsolution.io/p/practical-guides-of-boosting-in-scikit&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:143726664,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Vertical Solution&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b7ae4f2-591a-4737-a685-8db7e0968e99_300x300.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>Paper from Catboost <a href="https://arxiv.org/abs/1706.09516">[1]</a> <a href="https://arxiv.org/abs/1706.09516">[2]</a></p><p><a href="https://catboost.ai/en/docs/concepts/educational-materials-videos">Videos</a> on Catboost</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tiuq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tiuq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tiuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!tiuq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1456w" sizes="100vw" loading="lazy" fetchpriority="high"></picture><div></div></div></a></figure></div><p><em>&#128640; Subscribe to us for deep-dive content in DS, ML or AI! &#128640;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.datascienceletter.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Covariate adjustments in AB tests]]></title><description><![CDATA[The whats, whys and hows of covariate adjustments in AB tests!]]></description><link>https://newsletter.datascienceletter.com/p/covariate-adjustments-in-ab-tests</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/covariate-adjustments-in-ab-tests</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Sat, 11 May 2024 00:07:10 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/a3befd29-45bb-4e55-84ba-a412c9948e03_1080x1620.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vkj4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4c7b3a9-24fe-4313-ab46-1b3901cda026_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vkj4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4c7b3a9-24fe-4313-ab46-1b3901cda026_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!Vkj4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4c7b3a9-24fe-4313-ab46-1b3901cda026_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!Vkj4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4c7b3a9-24fe-4313-ab46-1b3901cda026_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!Vkj4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4c7b3a9-24fe-4313-ab46-1b3901cda026_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vkj4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4c7b3a9-24fe-4313-ab46-1b3901cda026_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d4c7b3a9-24fe-4313-ab46-1b3901cda026_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vkj4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4c7b3a9-24fe-4313-ab46-1b3901cda026_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!Vkj4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4c7b3a9-24fe-4313-ab46-1b3901cda026_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!Vkj4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4c7b3a9-24fe-4313-ab46-1b3901cda026_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!Vkj4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4c7b3a9-24fe-4313-ab46-1b3901cda026_1128x191.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><blockquote><p><em>&#128640; Subscribe to us @ <a href="http://newsletter.verticalsolution.io/">newsletter.verticalsolution.io</a> for more content in DS, ML or AI! &#128640;</em></p></blockquote>
      <p>
          <a href="https://newsletter.datascienceletter.com/p/covariate-adjustments-in-ab-tests">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Our list of deep-dive topics on causal inference and experimentation]]></title><description><![CDATA[Deep-dive series to causal inference methods and advanced experimentation topics, from covariate adjustment to causal ML, from AB testing to synthetic control.]]></description><link>https://newsletter.datascienceletter.com/p/causal-inference-and-experimentation</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/causal-inference-and-experimentation</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Sat, 04 May 2024 18:01:47 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/fe23b8a6-570e-4180-97b9-3fd29b3b369b_1080x721.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bm6Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df13d03-9113-457f-9587-95b19f3d1ba4_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bm6Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df13d03-9113-457f-9587-95b19f3d1ba4_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!Bm6Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df13d03-9113-457f-9587-95b19f3d1ba4_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!Bm6Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df13d03-9113-457f-9587-95b19f3d1ba4_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!Bm6Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df13d03-9113-457f-9587-95b19f3d1ba4_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bm6Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df13d03-9113-457f-9587-95b19f3d1ba4_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8df13d03-9113-457f-9587-95b19f3d1ba4_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Bm6Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df13d03-9113-457f-9587-95b19f3d1ba4_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!Bm6Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df13d03-9113-457f-9587-95b19f3d1ba4_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!Bm6Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df13d03-9113-457f-9587-95b19f3d1ba4_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!Bm6Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df13d03-9113-457f-9587-95b19f3d1ba4_1128x191.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><blockquote><p><em>&#128640; Subscribe to us @ <a href="http://newsletter.verticalsolution.io/">newsletter.verticalsolution.io</a> for more content in DS, ML or AI! &#128640;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.datascienceletter.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><h1>Causal inference</h1><pre><code><code>This session covers various topics on causal inference and observational studies in data science.</code></code></pre><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;782a5c1c-d7e3-4526-b851-9e207807649f&quot;,&quot;caption&quot;:&quot;&#128640; Subscribe to us to more deep-dive topics on data science, machine learning and artificial intelligence!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Netflix measures the value of subscriber acquisition and retention using causal inference and Markov chain&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:217027086,&quot;name&quot;:&quot;Vertical Solution&quot;,&quot;bio&quot;:&quot;We are a group of data scientists in the tech sector, passionate about solving problems with data. We came from various background, such as particle physics, computational biology, computer science. We are also passionate about writing data science.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cb522d0-4d41-41ac-81a6-7bce18164a16_300x300.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-19T11:17:19.857Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b9462a69-4f80-4d38-a4e7-7f33d055d0c2_10000x5625.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://newsletter.verticalsolution.io/p/beyond-customer-lifetime-valuation&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142753041,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Vertical Solution&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b7ae4f2-591a-4737-a685-8db7e0968e99_300x300.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div><hr></div><h1>Experimentation</h1><pre><code>This session covers various topics on experimentation and AB testing in data science.</code></pre><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;a2dc65a6-972c-4162-ba7d-cc53dfb21de4&quot;,&quot;caption&quot;:&quot;In this series, we will go over the following topics from causal inference methods: Randomised control trials (RCT) [covered in this article] Covariate adjustment in RCT Regression analysis Propensity score matching Propensity score weighting Double-robust estimators&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Randomised control trials&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:217027086,&quot;name&quot;:&quot;Vertical Solution&quot;,&quot;bio&quot;:&quot;We are a group of data scientists in the tech sector, passionate about solving problems with data. We came from various background, such as particle physics, computational biology, computer science. We are also passionate about writing data science.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cb522d0-4d41-41ac-81a6-7bce18164a16_300x300.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-05-04T17:58:16.367Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7722c47a-d2ba-48de-bd0c-9bdf875670cc_1080x720.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://newsletter.verticalsolution.io/p/randomised-control-trials&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144095020,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Vertical Solution&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b7ae4f2-591a-4737-a685-8db7e0968e99_300x300.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;a259fa1b-8913-48b2-9dcc-64bf841dbf53&quot;,&quot;caption&quot;:&quot;&#128640; Subscribe to us @ newsletter.verticalsolution.io for more content in DS, ML or AI! &#128640;&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Covariate adjustments in AB tests&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:217027086,&quot;name&quot;:&quot;Vertical Solution&quot;,&quot;bio&quot;:&quot;We are a group of data scientists in the tech sector, passionate about solving problems with data. We came from various background, such as particle physics, computational biology, computer science. We are also passionate about writing data science.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cb522d0-4d41-41ac-81a6-7bce18164a16_300x300.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-05-11T00:07:10.981Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57897f80-37e7-4916-8c12-16d0cc676c91_1080x720.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://newsletter.verticalsolution.io/p/covariate-adjustments-in-ab-tests&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144313045,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Vertical Solution&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b7ae4f2-591a-4737-a685-8db7e0968e99_300x300.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div><hr></div><blockquote><p><em>&#128640; Subscribe to us @ <a href="http://newsletter.verticalsolution.io/">newsletter.verticalsolution.io</a> for more content in DS, ML or AI! &#128640;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.datascienceletter.com/subscribe?"><span>Subscribe now</span></a></p></blockquote>]]></content:encoded></item><item><title><![CDATA[Randomised control trials]]></title><description><![CDATA[Why is RCT important? What are common types of RCTs in data science?]]></description><link>https://newsletter.datascienceletter.com/p/randomised-control-trials</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/randomised-control-trials</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Sat, 04 May 2024 17:58:16 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7722c47a-d2ba-48de-bd0c-9bdf875670cc_1080x720.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this series, we will go over the following topics from causal inference methods:</p><ul><li><p>Randomised control trials (RCT) <strong>[covered in this article]</strong></p></li><li><p>Covariate adjustment in RCT</p></li><li><p>Regression analysis</p></li><li><p>Propensity score matching</p></li><li><p>Propensity score weighting</p></li><li><p>Double-robust estimators</p></li><li><p>Subgroup analysis</p></li><li><p>Double machine learning</p></li><li><p>Instrumental variable</p></li><li><p>Difference-in-difference</p></li><li><p>Regressio&#8230;</p></li></ul>
      <p>
          <a href="https://newsletter.datascienceletter.com/p/randomised-control-trials">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Practical guides of boosting in scikit-learn]]></title><description><![CDATA[An deep-dive article on boosting algorithms in scikit-learn]]></description><link>https://newsletter.datascienceletter.com/p/practical-guides-of-boosting-in-scikit</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/practical-guides-of-boosting-in-scikit</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Fri, 19 Apr 2024 23:40:18 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/001ecc64-55c6-4b78-8e0e-0d2a06adc4be_1080x721.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dtqF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dtqF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dtqF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dtqF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>&#128640; Subscribe to us to more deep-dive topics on data science, machine learning and artificial intelligence!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.datascienceletter.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><p>This article highlights boosting algorithms in scikit-learn. For a conceptual introduction to gradient boosting, please visit our article <a href="https://newsletter.verticalsolution.io/p/boosting-algorithms-in-machine-learning">here</a>.</p><h2>Introduction</h2><p>In essence, scikit-learn offers two options for gradient-boosted trees:</p><ol><li><p>Gradient boosting</p></li><li><p>Histogram-based gradient boosting </p></li></ol><p>For gradient boosting, one can access the class:</p><pre><code># Gradient boosting
GradientBoostingClassifier
GradientBoostingRegressor</code></pre><p>while for histogram-based gradient boosting, one can use:</p><pre><code><code># Histogram-based gradient boosting
HistGradientBoostingClassifier
HistGradientBoostingRegressor</code></code></pre><p>Each method will be explained in detail below in separate sessions. </p><p>It is useful to note the pros and cons when using each of the options.</p><p>In terms of speed, histogram-based gradient boosting is generally faster than gradient boosting as the node splitting complexity of gradient boosting is proportional to the number of data points and the number of features.</p><p>In terms of accuracy, as the node splitting procedure is coarser in histogram-based gradient boosting, gradient boosting generally could develop a more complex model as the node splitting is finer.</p><p>The histogram-based implementation in scikit-learn supports handling of missing values and categorical data, making upstream data preprocessing steps a bit easier.</p><div><hr></div><h2>Gradient boosting</h2><p>The class <code>GradientBoostingClassifier</code> and <code>GradientBoostingRegressor</code> implemented in scikit-learn use the classic gradient boosting algorithm described in our article <a href="https://newsletter.verticalsolution.io/i/142752926/gradient-boosting">here</a>.</p><p>In particular, at each boosting round, a regression tree <code>DecisionTreeRegressor</code> is fitted using the negative gradients of the loss function as target labels (the gradients are computed at prediction values from the previous iterations). The objective of the regression tree is either mean square error (MSE) or MSE with improvement score by Friedman. This objective is also used to compute the quality of each split.</p><p>It is important to note that other implementations such as LightGBM or XGBoost have different tree fitting procedures.</p><p>These two classes are relatively slow when the training dataset is large, as each node split is determined by looking at all possible split for all features.</p><div><hr></div><h2>Histogram-based gradient boosting</h2><p>Inspired by LightGBM, scikit-learn also implemented histogram-based gradient boosting.</p><p>During tree fitting, the training samples are first binned into integer-valued histograms (the default value for the number of bins is 256). Then for each node splitting these histograms are used instead of considering all possible splits for each features. This usually speeds up training time as there is no need to sort dataset for each feature for each boosting round.</p><p>Also like LightGBM, this class also support handling of missing values. In particular, during training, samples with missing values would be assigned to the node (left node or right node) which maximises the potential gain from the node splitting. During inference, the samples with missing values will be assigned to the same node as in the training accordingly. If no missing values are encountered during training, the samples with missing values will be assigned to the node with more samples.</p><div><hr></div><h2>Summary</h2><p>In conclusion, this article explored the concept of boosting algorithms within the scikit-learn library.</p><p>We reviewed the foundations of boosting, a technique for creating powerful ensembles by sequentially training weak learners. We then delved deeper into gradient boosting, a popular boosting method that builds an additive model by fitting decision trees on the negative gradients of a loss function.</p><p>Finally, we introduced histogram-based gradient boosting, a faster variant particularly suitable for larger datasets.</p><p>By understanding these boosting techniques in scikit-learn, you can leverage their ensemble power to enhance the accuracy and robustness of your machine learning models!</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Vertical Solution is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!55lC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3199af1-add6-416f-8aea-50a90031b9fc_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!55lC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3199af1-add6-416f-8aea-50a90031b9fc_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!55lC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3199af1-add6-416f-8aea-50a90031b9fc_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!55lC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3199af1-add6-416f-8aea-50a90031b9fc_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!55lC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3199af1-add6-416f-8aea-50a90031b9fc_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!55lC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3199af1-add6-416f-8aea-50a90031b9fc_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3199af1-add6-416f-8aea-50a90031b9fc_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!55lC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3199af1-add6-416f-8aea-50a90031b9fc_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!55lC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3199af1-add6-416f-8aea-50a90031b9fc_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!55lC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3199af1-add6-416f-8aea-50a90031b9fc_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!55lC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3199af1-add6-416f-8aea-50a90031b9fc_1128x191.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div>]]></content:encoded></item><item><title><![CDATA[Practical guides on bootstrapping]]></title><description><![CDATA[How to use bootstrapping effectively in data science domains? What are the common considerations?]]></description><link>https://newsletter.datascienceletter.com/p/practical-guides-on-bootstrapping</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/practical-guides-on-bootstrapping</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Fri, 05 Apr 2024 20:58:59 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/99a12c05-609b-4bc7-a45f-fa8995d3b570_1080x720.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kJbB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc57217e-517a-41c9-afdb-cec197b4c637_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kJbB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc57217e-517a-41c9-afdb-cec197b4c637_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!kJbB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc57217e-517a-41c9-afdb-cec197b4c637_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!kJbB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc57217e-517a-41c9-afdb-cec197b4c637_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!kJbB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc57217e-517a-41c9-afdb-cec197b4c637_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kJbB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc57217e-517a-41c9-afdb-cec197b4c637_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc57217e-517a-41c9-afdb-cec197b4c637_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kJbB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc57217e-517a-41c9-afdb-cec197b4c637_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!kJbB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc57217e-517a-41c9-afdb-cec197b4c637_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!kJbB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc57217e-517a-41c9-afdb-cec197b4c637_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!kJbB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc57217e-517a-41c9-afdb-cec197b4c637_1128x191.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>Consider subscribing to our paid version with a price of one cup of cappuccino per month if you find our content useful!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.datascienceletter.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h4>What is bootstrapping?</h4><p>The idea of bootstrapping is that statistical inference about a population from a sample data can be approximated by two steps:</p><ol><li><p>resampling the sample data</p></li><li><p>statistical inference from resampled data</p></li></ol><p>In this following sections we will talk about various types of resampling and statistical inference methods in bootstrapping.</p><div><hr></div><h4>Practical considerations when using bootstrapping</h4><p>Statistical inference involves drawing statistical conclusions from a finite data sample and extrapolate these conclusions to the whole population. </p><p>However in reality, even drawing statistical conclusions from a finite data sample could be tricky in the following ways:</p><ol><li><p>We do not know a-priori the correct probability density function (pdf) for our chosen test statistics.</p></li><li><p>Even if we know the correct pdf for our test statistic, it may not be in a nice analytical form.</p></li></ol><p>Bootstrapping is particularly advantageous in scenarios where either an analytical expression for the sampling distribution is unavailable or the application of asymptotic theory (e.g., central limit theorem) is uncertain.</p><p>If we have a large sample data, this sample can in fact approximate the true underlying pdf.</p><p>It is also worth noting that the quality of inference from the resampled data could be known as we can compare these inference results with the &#8216;true&#8217; population (the original sample data). This could provide a proxy for the inference result that we would like to know for the true population.</p><div><hr></div><h4>Resampling methods &#8212; Case bootstrapping&nbsp;</h4><p>Case bootstrapping is probably the simplest resampling methods. This method involves drawing samples with <strong>replacement</strong> from the original data set:</p><blockquote><p>Assuming the original data set has n observations</p><ol><li><p>k resampled data with n observation (also called bootstrap sample) will be formed by sampling with replacement, where k is often between 50 to 1000.</p></li><li><p>The test statistic of interest is calculated for each resampled dataset.</p></li></ol></blockquote><p>In the end, A distribution of the test statistic of interested (often called the Bootstrap distribution) with size k is generated. With one of the statistical inference method explained below, this distribution can be used to perform hypothesis testing or confidence interval estimation.</p><h4>Resampling methods &#8212; Bayesian bootstrapping&nbsp;&nbsp;</h4><p>This method is first proposed in this <a href="https://projecteuclid.org/journals/annals-of-statistics/volume-9/issue-1/The-Bayesian-Bootstrap/10.1214/aos/1176345338.full">paper</a> in 1981.</p><p>Instead of actually sampling the original data set, Bayesian bootstrapping creates bootstrap samples through reweighing the original data set:</p><blockquote><p>Assuming the original data set has n observations</p><ol><li><p>Generate a list of random number, let&#8217;s call this list w, in the interval [0,1] with size n-1, sort this list and append 0 and 1 in the start and the end.</p></li><li><p>Reweigh each data point by the following formula:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w(i+1) - w(i)&quot;,&quot;id&quot;:&quot;IHMTNMETBT&quot;}" data-component-name="LatexBlockToDOM"></div><p>where w(0) = 0 and w(1) = 1</p></li><li><p>The test statistic of interest is calculated for this reweighed data set.</p></li><li><p>Repeat this process k times, where k is often between 50 to 1000 (similar to  what we have in case bootstrapping)</p></li></ol></blockquote><h4>Resampling methods &#8212; Poisson bootstrapping&nbsp;</h4><p>This method is explained in, for example this <a href="https://research.google/pubs/estimating-uncertainty-for-massive-data-streams/">paper</a> from Google in 2012.</p><p>The idea of this method is based on the intuition that the procedure of case bootstrapping is actually equivalent to assigning a weight vector to each data point, where the weight vector is drawn from multi-binomial distribution:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w \\sim \\mathrm{Multinom}_N(1/N,...,/1/N)&quot;,&quot;id&quot;:&quot;ICHCKJCMNA&quot;}" data-component-name="LatexBlockToDOM"></div><p>It is worth noting that in the limit when N goes to infinity, we have:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\lim_{N\\to\\inf} \\mathrm{Binomial}(N,\\frac{1}{N}) \\sim \\mathrm{Poisson}(1)&quot;,&quot;id&quot;:&quot;YHMMKRZBUH&quot;}" data-component-name="LatexBlockToDOM"></div><p>This means that for large dataset, bootstrapping with weights iid-drawn from Poisson(1) will give a similar result as with the case bootstrapping (sampling with replacement).</p><p>The advantage with Poisson bootstrapping is that it is not necessary to know the total number of data points in advance during bootstrapping. Hence this method can be easily parallelised in distributed computing environment such as spark.</p><div><hr></div><h4>Statistical inference &#8212; Percentile method</h4><p>The basic method is very simple. After getting the bootstrap distribution of the test statistic, one can simply compute the empirical quantiles and use these as the confidence interval:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;[\\hat{\\theta}_{\\alpha/2},\\hat{\\theta}_{1-\\alpha/2}]&quot;,&quot;id&quot;:&quot;IAXKBBIKBN&quot;}" data-component-name="LatexBlockToDOM"></div><p>where &#945; is the confidence level. In addition, there is also another method which uses the reverse percentile as the confidence interval estimation, i.e.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;[2\\hat{\\theta}-\\hat{\\theta}_{1-\\alpha/2},2\\hat{\\theta} -\\hat{\\theta}_{\\alpha/2}]&quot;,&quot;id&quot;:&quot;LQMNRDAGTF&quot;}" data-component-name="LatexBlockToDOM"></div><p>It is worth noting that this method is recommended when the bootstrap distribution does not have a long tail and is mostly symmetric.</p><h4>Statistical inference &#8212; Studentised method</h4><p>In general, the studentised method is a more accurate method, but comes with a larger computational cost, as it involves two-step bootstrapping This is how it works:</p><ol><li><p>Construct a set of bootstrap sample:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x^{(b)}_{1},x^{(b)}_{2},...,x^{(b)}_{n}, b = 1,2,...,B&quot;,&quot;id&quot;:&quot;QSOPMVQUKQ&quot;}" data-component-name="LatexBlockToDOM"></div></li><li><p>Calculate the test-statistic for each bootstrap sample as other methods</p></li><li><p>For each bootstrap sample, i.e. b = 1, 2, 3, &#8230; , B,</p><ol><li><p>Construct another set of bootstrap sample:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x^{(b,m)}_{1},x^{(b,m)}_{2},...,x^{(b,m)}_{n}, m = 1,2,...,M&quot;,&quot;id&quot;:&quot;ZGLKVFAQUI&quot;}" data-component-name="LatexBlockToDOM"></div></li><li><p>Calculate the test-statistic for each new bootstrap sample</p></li><li><p>Calculate the standard deviation based on these M test-statistics</p></li></ol></li><li><p>Calculate the t-statistic for each bootstrap sample:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;t^{(b)} = \\frac{\\theta^{(b)} - \\hat{\\theta}}{s^{(b)}}&quot;,&quot;id&quot;:&quot;NPBMFXRFAB&quot;}" data-component-name="LatexBlockToDOM"></div></li><li><p>Construct &#945;/2 and 1-&#945;/2 quantile from the boostrapped t-statistic distribution</p></li><li><p>The confidence interval can be calculated as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;[\\hat{\\theta}-sq_{\\alpha/2},\\hat{\\theta}-sq_{1-\\alpha/2}]&quot;,&quot;id&quot;:&quot;XKKLJEWYXD&quot;}" data-component-name="LatexBlockToDOM"></div></li></ol><div><hr></div><h4>Key takeaways</h4><p>Advantages of bootstrapping:</p><ul><li><p>It is quite general and can be applicable to a wide range of statistical inference problems</p></li><li><p>It is conceptually simple to be used to estimate statistical quantities, such as standard errors and confidence intervals, even for very complex test statistics</p></li><li><p>It can also incorporate different sampling methods easily</p></li></ul><p>Disadvantages of bootstrapping:</p><ul><li><p>It can be computationally intensive</p></li><li><p>It is subject to the quality of the original data sample</p></li><li><p>As an empirical method, its assumptions and validity are harder to be examined </p></li></ul><div><hr></div><h4>Summary</h4><p>This article summarises a few common methods in bootstrapping, a statistical method for estimating sampling distributions. We explored various resampling methods, such as case bootstrapping, Bayesian bootstrapping and Poisson bootstrapping. We have also talked about a few methods to perform confidence interval estimation given a set of bootstrap samples.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IkBv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0af44ae4-77cf-4754-bd26-08b3298e60ca_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IkBv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0af44ae4-77cf-4754-bd26-08b3298e60ca_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!IkBv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0af44ae4-77cf-4754-bd26-08b3298e60ca_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!IkBv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0af44ae4-77cf-4754-bd26-08b3298e60ca_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!IkBv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0af44ae4-77cf-4754-bd26-08b3298e60ca_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IkBv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0af44ae4-77cf-4754-bd26-08b3298e60ca_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0af44ae4-77cf-4754-bd26-08b3298e60ca_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IkBv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0af44ae4-77cf-4754-bd26-08b3298e60ca_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!IkBv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0af44ae4-77cf-4754-bd26-08b3298e60ca_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!IkBv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0af44ae4-77cf-4754-bd26-08b3298e60ca_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!IkBv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0af44ae4-77cf-4754-bd26-08b3298e60ca_1128x191.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe to the paid version of the newsletter to enjoy more premium content!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Concepts of multi-armed bandits]]></title><description><![CDATA[An essential introduction to key concepts of multi-armed bandits, epsilon-greedy, UCB, Bayesian UCB, Thompson sampling]]></description><link>https://newsletter.datascienceletter.com/p/concepts-of-multi-armed-bandits</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/concepts-of-multi-armed-bandits</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Mon, 01 Apr 2024 22:19:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d7bd4b33-2766-4406-bfbf-9bcab9c3f175_1080x720.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0LWV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa6e08b-526f-4e3f-aa30-03c53d2dfdc5_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0LWV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa6e08b-526f-4e3f-aa30-03c53d2dfdc5_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!0LWV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa6e08b-526f-4e3f-aa30-03c53d2dfdc5_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!0LWV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa6e08b-526f-4e3f-aa30-03c53d2dfdc5_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!0LWV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa6e08b-526f-4e3f-aa30-03c53d2dfdc5_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0LWV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa6e08b-526f-4e3f-aa30-03c53d2dfdc5_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/daa6e08b-526f-4e3f-aa30-03c53d2dfdc5_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0LWV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa6e08b-526f-4e3f-aa30-03c53d2dfdc5_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!0LWV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa6e08b-526f-4e3f-aa30-03c53d2dfdc5_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!0LWV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa6e08b-526f-4e3f-aa30-03c53d2dfdc5_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!0LWV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa6e08b-526f-4e3f-aa30-03c53d2dfdc5_1128x191.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>If you find that our articles are useful for your machine learning journey, consider subscribing to our paid version with a price of one cup of cappuccino per month.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.datascienceletter.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h4>Why multi-armed bandit?</h4><p>Compared to typical machine learning, (non-contextual) multi-armed bandits, also as known as MAB,  have several practical advantages:</p><ol><li><p>Light model serving &#8212; Due to model simplicity, MAB typically does not require a separate model call to obtain predictions. In addition, MAB does not require features. It is much easier to serve a MAB model as resources are not needed to build up a feature engineering pipeline and suitable model architectures.</p></li><li><p>Easy maintenance &#8212; Due to its little dependency on other parts of the pipeline (such as feature store, model prediction call), a MAB solution is typically much easier to maintain.</p></li><li><p>Handle of data drift &#8212; Customer behaviours change over time. MAB is typically used as an online learning algorithm, making it a natural solution to handle data drift.</p></li></ol><div><hr></div><h4>Stochastic Bandit</h4><p>Let&#8217;s start with one of the simplest form of bandits.</p><blockquote><p>Problem statement</p></blockquote><p>Given a fixed number of round T and K arms: </p><p>For each round:</p><ol><li><p>A policy (or a model, an algorithm) will select one arm, hence performing an action.</p></li><li><p>Based on this action, the reward for this action is sampled from the corresponding reward distribution. </p></li><li><p>The model uses this reward to update itself.</p></li></ol><p>The process repeats T times.</p><blockquote><p>Assumptions</p></blockquote><p>In addition to the formulation, there are two major assumptions in this simplest form of MAB:</p><ol><li><p>We can only observe rewards for the chosen arm at each round</p></li><li><p>The reward distribution of each arm is independent of each other. When an arm is chosen, the observed reward is sampled from the reward distribution of the chosen arm. Rewards in each round are hence iid.</p></li></ol><p>Care has to be taken for Assumption 2. For data science problems in which data drift is an issue, it may be necessary to:</p><ul><li><p>use other MAB algorithms which could approximately account for drift in data or reward distribution. For example contextual bandits with time-dependent or seasonal features. For an introduction in contextual bandit, please visit our articles here.</p></li><li><p>Maintain a certain amount of exploration, such as epsilon-greedy algorithm. Concepts of exploitation and exploration will be described in the next section.</p></li></ul><h4>Exploitation and exploration</h4><p>In multi-armed bandits (or in reinforcement learning), exploration and exploitation represent two completely opposite strategies.</p><p>Given a bandit problem, exploitation represents choosing the best arm based on current knowledge of the system (which may be incomplete or misleading), while exploration means trying out suboptimal arms to collect extra information about the system.</p><p>It is important to note that finding the balance between exploitation and exploration is one of the most crucial points in multi-armed bandit problems. Various algorithms have been designed to tackle this point. In the following sections we will introduce a few of them.</p><h4>Epsilon-greedy algorithm</h4><blockquote><p>Definition</p></blockquote><p>Given a MAB problem with total number of round as T and K arms, the epsilon-greedy algorithm is as follows:</p><p>For a given value of epsilon, the epsilon-greedy algorithm is as follows:</p><div><hr></div><p>For each round <em>t = 1, 2, 3, &#8230;, T</em>, </p><ol><li><p>draw a random real-valued number <em>r</em> in the interval of [0,1]</p></li><li><p>To choose an arm, we do:</p><ol><li><p>If r is less than epsilon, choose an arm randomly </p></li><li><p>If r is greater than epsilon, choose the arm with the highest average reward</p></li></ol></li><li><p>Observe the reward with the chosen action, update the average reward for the chosen arm accordingly.</p></li></ol><div><hr></div><blockquote><p>Key takeaway</p></blockquote><ol><li><p><strong>Data draft</strong>: For a constant epsilon, this algorithm implies that there is always a certain degree of exploration. In case of data drifts, data collected from this constant exploration could be helpful to detect data drift and mitigate some of its bias. However, in a situation with no data drift, this constant exploration could be suboptimal.</p></li><li><p><strong>Non-adaptive exploration</strong>: The algorithm uses non-adaptive exploration. Alternatively, we can adjust epsilon according to the number of played rounds, the number of arms or other parameters. So that the exploration schedule is adapted to the history of observed rewards. There are other MAB algorithms with adaptive exploration schedules, such as UCB (which is explained below).</p></li></ol><h4>UCB algorithm</h4><blockquote><p>Upper confidence bound (UCB)</p></blockquote><p>Before introducing the UCB algorithm, it is necessary to define UCB (upper confidence bound):</p><p>UCB for a given arm <em>a</em> at a given round <em>t </em>is defined as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathrm{UCB}_{a,t} = \\hat{\\mu}_{a,t} + \\sqrt{\\frac{2\\mathrm{log}(t)}{n_{a,t}}}&quot;,&quot;id&quot;:&quot;RPCWAXEHUY&quot;}" data-component-name="LatexBlockToDOM"></div><p>where &#956; is the average of all observed rewards for arm <em>a </em>up to t round<em>,</em> <em>n</em> is the number of times that arm <em>a</em> has been selected up to <em>t</em> round.</p><p>Note that the latter term in the expression is an estimate of the confidence interval of the expected reward for arm <em>a</em> up to t round. Derivation of this confidence interval is based on <em>the Hoeffding Inequality</em>, which is explained in Appendix.</p><p>We are in the position to define the UCB algorithm.</p><blockquote><p>Definition</p></blockquote><p>Given a MAB problem with total number of round as T and K arms, the UCB algorithm is as follows:</p><div><hr></div><p>For each round <em>t = 1, 2, 3, &#8230;, T</em>, </p><ol><li><p>Pick the arm with the highest upper confidence bound (UCB)</p></li></ol><div><hr></div><p>The intuition behind the UCB algorithm is that, there are two scenarios which we would like to select an arm and observe its rewards:</p><ol><li><p>The average reward of this arm is higher than others arm. (Exploitation)</p></li><li><p>We have not selected this arm enough times to be confident about its average reward. (Exploration)</p></li></ol><p>The first case will give a high value in the first summand in the UCB formula, while the second case will mean a high value in the second summand. Therefore, picking the arm with the highest UCB represents one natural way to balance exploitation and exploration.</p><blockquote><p>Key takeaway</p></blockquote><ul><li><p>The confidence bound in the UCB algorithm is derived without any prior information on reward distributions. Hence the confidence bound would be a bit conservative.</p></li><li><p>In case when there are more knowledge a priori on reward distributions (for example they follow probability density functions, such as normal distribution), we can replace the UCB bound with specific confidence bounds derived from this pdf instead.</p></li><li><p>The UCB algorithm represents one of the most simplest algorithms which takes into account the uncertainty on average reward estimate for each arm. In this algorithm we use UCB as a heuristic to balance exploitation and exploration. There are other UCB variants which further expands on this idea.</p></li></ul><h4>Bayesian UCB algorithm</h4><p>Bayesian UCB is first proposed in this <a href="https://proceedings.mlr.press/v22/kaufmann12.html">paper</a>.</p><p>Unlike UCB which uses a specific formula based on <em>the</em> <em>Hoeffding bound</em>, to estimate confidence bounds, Bayesian UCB provides a more general mathematical framework to include custom priors and perform Bayesian updates on model parameters per round. </p><p>For a more general introduction on Bayesian statistics, please visit <a href="https://newsletter.verticalsolution.io/p/bayesian-statistics">this article</a> in our data science section.</p><blockquote><p>Definition</p></blockquote><p>For each round <em>t = 1, 2, 3, &#8230;, T</em>, </p><ol><li><p>Compute the 1-1/t quantile for each arm as below, where Q is the quantile function with the first argument as the confidence level, and &#955; as the posterior distribution for mean reward for arm j for round t-1.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;q_{j}(t) = Q(1-\\frac{1}{t},\\lambda^{t-1}_j)&quot;,&quot;id&quot;:&quot;AMCAKMXNCW&quot;}" data-component-name="LatexBlockToDOM"></div></li><li><p>Choose the arm corresponding to the highest quantile value</p></li><li><p>Observe reward and perform Bayesian update (posterior estimation) on &#955; for the chosen arm.</p></li></ol><blockquote><p>Key takeaway</p></blockquote><p>The main advantage of this algorithm is that it is a simple policy as UCB, but still generic enough to incorporate various Bayesian statistical models. This algorithm also has a proper Bayesian interpretation.</p><p>In addition, one can choose probability distribution from the one-parameter exponential family (such as normal distribution, Bernoulli distribution or gamma distribution) as their conjugate to simplify the procedure to estimate posterior distributions.</p><div><hr></div><h4>Thompson sampling</h4><p>Thompson Sampling is one of the most commonly-used bandit algorithms. It is proposed in this <a href="https://www.jstor.org/stable/2332286">paper</a> by Thompson in 1933.</p><p>Like Bayesian UCB, it is also a Bayesian bandit algorithm. Let&#8217;s look at what this algorithm does.</p><blockquote><p>Definition</p></blockquote><p>For each round <em>t = 1, 2, 3, &#8230;, T</em>, </p><ol><li><p>For each arm, sample its posterior distribution and get a sample of mean reward</p></li><li><p>Choose the arm with the highest sampled mean reward</p></li><li><p>Observe the resulting reward from this chosen arm</p></li><li><p>Perform Bayesian updates with this extra pair of data (chosen arm, received reward)</p></li></ol><blockquote><p>Key takeaways</p></blockquote><p>Although both Thompson sampling and Bayesian UCB are Bayesian bandits, there are some similarities and differences between the two algorithms.</p><h5><strong>Similarities</strong></h5><ol><li><p>At each round, after receiving a new pair of data point (chosen arm, received reward), the posterior distribution is updated and used as the prior for the next round. </p></li><li><p>Both algorithms can explore because the posterior distributions are not deterministic. The less uncertain the posterior distribution is, the more exploration that would be induced in both algorithms.</p></li><li><p>Both algorithms require a specification on initial prior distributions. A careful choice of prior can improve performance and convergence rate. </p></li></ol><h5>Difference</h5><ol><li><p>Bayesian UCB tends to be more optimistic in the beginning due to the schedule of the quantile 1-1/t.</p></li><li><p>Bayesian UCB uses an aggregated test-statistic, i.e. the (1-1/t)-th quantile of the posterior distribution, while the Thompson sampling uses a sampled value from the posterior distribution only.</p></li></ol><div><hr></div><h4>Summary</h4><p>In this article we have covered basic concepts of stochastic multi-armed bandits and a few commonly used MAB algorithms, including epsilon-greedy algorithm, UCB, Bayesian UCB and Thompson sampling. We have talked about the basic concepts of each algorithms and some practical considerations when using them. </p><p>However, there are a lot more to talk. Please stay tuned to our in-depth guides for each algorithm (and more)!</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0uPG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a5ed2e-8f46-4ac3-a957-655d39278c23_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0uPG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a5ed2e-8f46-4ac3-a957-655d39278c23_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!0uPG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a5ed2e-8f46-4ac3-a957-655d39278c23_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!0uPG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a5ed2e-8f46-4ac3-a957-655d39278c23_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!0uPG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a5ed2e-8f46-4ac3-a957-655d39278c23_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0uPG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a5ed2e-8f46-4ac3-a957-655d39278c23_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/66a5ed2e-8f46-4ac3-a957-655d39278c23_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0uPG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a5ed2e-8f46-4ac3-a957-655d39278c23_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!0uPG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a5ed2e-8f46-4ac3-a957-655d39278c23_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!0uPG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a5ed2e-8f46-4ac3-a957-655d39278c23_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!0uPG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a5ed2e-8f46-4ac3-a957-655d39278c23_1128x191.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe to the paid version of the newsletter to enjoy more premium content!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Selection bias]]></title><description><![CDATA[What is Selection bias? How does it affect interpretations of AB tests?]]></description><link>https://newsletter.datascienceletter.com/p/selection-bias</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/selection-bias</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Thu, 28 Mar 2024 00:15:18 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3612eb45-54df-4b77-b25c-83af0e8383db_1080x715.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VGYY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f8e6f0b-33b0-4256-a07a-533030993e7f_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VGYY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f8e6f0b-33b0-4256-a07a-533030993e7f_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!VGYY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f8e6f0b-33b0-4256-a07a-533030993e7f_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!VGYY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f8e6f0b-33b0-4256-a07a-533030993e7f_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!VGYY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f8e6f0b-33b0-4256-a07a-533030993e7f_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VGYY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f8e6f0b-33b0-4256-a07a-533030993e7f_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8f8e6f0b-33b0-4256-a07a-533030993e7f_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VGYY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f8e6f0b-33b0-4256-a07a-533030993e7f_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!VGYY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f8e6f0b-33b0-4256-a07a-533030993e7f_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!VGYY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f8e6f0b-33b0-4256-a07a-533030993e7f_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!VGYY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f8e6f0b-33b0-4256-a07a-533030993e7f_1128x191.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><div><hr></div><p>If you find that our articles are useful for your data science journey, consider subscribing to our paid version with a price of one cup of cappuccino per month.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.datascienceletter.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><p>Before going into the detail of sampling and selection bias, we need to clarify the concept of population and sample in statistics. Those who are familiar with the concept can skip the next section</p><h2>Population and sample</h2><p>To put it in very simple words, a population is the whole set of statistical unit (such as all people living in US when conducting an US income survey) that we would like to draw statistical conclusion on. </p><p>However, in reality, it is impossible to survey or gather data for every statistical unit. Therefore, we often sample a finite number of statistical unit and perform statistical inference on this finite group to infer information on the whole population. The finite group is a sample. </p><p>It is important to note that since we are effectively extrapolating conclusions drawn with a sample to a population, it is important to respect a few statistical concepts, sampling and selection bias are one of these concepts to respect. This ability of extrapolation is called external validity in statistics.</p><h2>Random sampling</h2><p>Random sampling constitutes a selection process where every element in the target population possesses an identical probability of inclusion in the sample at each draw. The resulting collection is termed a simple random sample. </p><p>This process can be implemented with replacement, where observations return to the population pool after each selection, allowing for potential re-selection in subsequent draws. Alternatively, random sampling can be conducted without replacement, where chosen observations are no longer available for future selections.</p><p>Random sampling is the simplest form of sampling. It requires a complete sampling frame (i.e. access to all statistical units), which could be or not be available. This sampling method is also a representative of the underlying population.</p><h2>Selection bias</h2><p>Selection bias occurs when statistical units are selected (or sampled) from a population in such a way that randomisation is not properly realised, hence the obtained sample is not a representative of the population.</p><p>While there are many types of selection bias in statistics, in this article we will explain a few of them below.</p><h4>Sampling bias</h4><p>Sampling bias occurs when certain statistical units are selected consistently than other units. Note that sampling bias is one of the most common ways leading to selection bias.</p><h4>Early stopping</h4><p>Selection bias can arise from early termination of an AB test at a time when its results support the desired conclusion.</p><h4>Data handling</h4><p>Selection bias can also arise from filtering on a sample obtained from a randomised trial based on certain data properties, such as rejecting bad data</p><h4>Cherry-picking</h4><p>Selection bias can also occurs when a subset of sample is specifically selected to support a &#8216;conclusion&#8217;, or when a set of studies are specifically selected when conducting meta-analysis.</p><h2>Summary</h2><p>While in some cases it is trivial to avoid selection bias, in other cases it is impossible to avoid them. It is important to understand that such bias exists and how it could even affect conclusions drawn from properly-designed experiments. </p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iexQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56216f3-a362-481b-8d48-771d5c0d989b_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iexQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56216f3-a362-481b-8d48-771d5c0d989b_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!iexQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56216f3-a362-481b-8d48-771d5c0d989b_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!iexQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56216f3-a362-481b-8d48-771d5c0d989b_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!iexQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56216f3-a362-481b-8d48-771d5c0d989b_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iexQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56216f3-a362-481b-8d48-771d5c0d989b_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f56216f3-a362-481b-8d48-771d5c0d989b_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iexQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56216f3-a362-481b-8d48-771d5c0d989b_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!iexQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56216f3-a362-481b-8d48-771d5c0d989b_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!iexQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56216f3-a362-481b-8d48-771d5c0d989b_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!iexQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56216f3-a362-481b-8d48-771d5c0d989b_1128x191.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe to the paid version of the newsletter to enjoy more premium content!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Practical guides on LightGBM]]></title><description><![CDATA[Deep dive into practicalities of LightGBM]]></description><link>https://newsletter.datascienceletter.com/p/practical-guides-on-lightgbm</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/practical-guides-on-lightgbm</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Tue, 26 Mar 2024 15:59:37 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/a07e47a0-d214-4b7f-a20f-01995e6dac31_1080x721.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dtqF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dtqF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dtqF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dtqF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">To access more premium content, subscribe to paid version of the newsletter!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>About this post</h2><p>This post explains practical knowledges when using LightGBM.</p><p>For a more conceptual introduction of boosting, please visit another post &#8216;<a href="https://verticalsolution.substack.com/p/boosting-algorithms-in-machine-learning">Concepts of boosting algorithms in machine learning&#8217;</a>.</p><p>For an introduction on XGBBoost, please visit another post &#8216;<a href="https://verticalsolution.substack.com/p/practical-guides-on-xgboost">Practical guides on XGBoost</a>.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qKwW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qKwW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 424w, https://substackcdn.com/image/fetch/$s_!qKwW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 848w, https://substackcdn.com/image/fetch/$s_!qKwW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 1272w, https://substackcdn.com/image/fetch/$s_!qKwW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qKwW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg" width="1456" height="330" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:330,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Welcome to LightGBM's documentation! &#8212; LightGBM 4.0.0 documentation&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Welcome to LightGBM's documentation! &#8212; LightGBM 4.0.0 documentation" title="Welcome to LightGBM's documentation! &#8212; LightGBM 4.0.0 documentation" srcset="https://substackcdn.com/image/fetch/$s_!qKwW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 424w, https://substackcdn.com/image/fetch/$s_!qKwW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 848w, https://substackcdn.com/image/fetch/$s_!qKwW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 1272w, https://substackcdn.com/image/fetch/$s_!qKwW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><div><hr></div><h2>Introduction</h2><p>Like <em>XGBoost</em>, <em>LightGBM</em> is a powerful implementation of tree-based boosting algorithm.</p><p>This article explores the core features of <em>LightGBM</em>, focusing on gradient-based sampling and exclusive feature bundling, and practical considerations with this open source package. By reading this article, you can understand the essence of <em>LightGBM</em> and its differences to other implementations.</p><p>The official website for <em>LightGBM</em> can be found <a href="https://lightgbm.readthedocs.io/en/stable/">here</a>. Its corresponding paper can be found <a href="https://proceedings.neurips.cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf">here</a>.</p><div><hr></div><h2>Core features</h2><h4>Newton tree boosting</h4><blockquote><p>LightGBM uses Newton tree boosting!</p></blockquote><p>Like <em>XGBoost</em>, <em>LightGBM</em> uses Newton descent at each boosting iteration, although the corresponding paper for LightGBM does not mention the usage of Newton descent. From the implementation on <a href="https://github.com/microsoft/LightGBM/blob/master/src/treelearner/feature_histogram.hpp">github</a> (search for the static double function <code>GetLeafGain</code>), it can be seen that the leaf gain in a node splitting is evaluated as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;L_{k} = -\\frac{1}{2} \\sum_{i=1}^{N_{\\mathrm{node}}} \\frac{\\mathrm{max}((\\sum_{j\\in\\mathrm{node~j}} g_j)-\\lambda_{\\mathrm{L1}},0.)^2}{\\sum_{j\\in\\mathrm{node~j}} h_j+\\lambda_{\\mathrm{L2}}}&quot;,&quot;id&quot;:&quot;LYBYXUIIPU&quot;}" data-component-name="LatexBlockToDOM"></div><p>where apart from regularisation parameters, are the same as the formula implemented in <em>XGBoost</em>.</p><p>Please refer to the article &#8216;<a href="https://verticalsolution.substack.com/p/boosting-algorithms-in-machine-learning">Concepts of boosting algorithms in machine learning</a>&#8217; for an introduction to gradient tree boosting and Newton boosting.</p><h4>Gradient-based one side sampling (GOSS)</h4><blockquote><p>GOSS is a way to speed up training time by sampling data based on gradient values!</p></blockquote><p>Specifically, GOSS firstly sorts the data points with the absolute value of their gradients. The top a &#215; 100% points are selected as part of the training data. In addition, GOSS also randomly samples b &#215; 100% instances from the remaining of the data (these are data points with smaller gradients). To compensate for the difference in sampling, GOSS reweighs the sampled data with small gradients by a constant (1&#8722;a) / b when calculating the information gain.</p><p>This method helps reduce computation time by reducing the number of data points used during tree growing without changing the original data distribution by much.</p><h4><strong>Node splitting for categorical features</strong> </h4><blockquote><p>Node splitting with categorical variables are often better handled in LightGBM than with typical methods like one-hot encoding!</p></blockquote><p>One common approach to encode categorical features is one-hot encoding. It is often observed that this approach is not very optimal for tree learning algorithms, especially with categorical features with high cardinality.</p><p>LightGBM instead handles categorical features natively. Specifically at node splitting, a categorical feature is split by dividing its categorical into two subset. There are 2^k possible ways to do so. The quality of split is then evaluated for each partition and the best split is then chosen.</p><p>This is in many cases a better approach than one-hot encoding. At each split all categorical values are considered and handled, while with one-hot encoding only one categorical value will be handled at each split.</p><h4>Exclusive feature bundling</h4><p>This method aims to reduce the number of sparse features during tree growing by regrouping mutually exclusive sparse features into bundles. The bundle of exclusive features into a single feature is called an exclusive feature bundle.</p><div><hr></div><h2>Interesting features</h2><h4>Pairwise linear regression at each leaf</h4><blockquote><p>Instead of using the average values in each leaf for predictions, LightGBM can perform linear regressions within each leaf and use these linear model to generate  predictions instead. This function can be enabled by the argument <code>linear_tree</code>.</p></blockquote><p>Since Dec 2020 (merge request <a href="https://github.com/microsoft/LightGBM/pull/3299">here</a>), it is possible to use piecewise linear regression in <em>LightGBM</em>. This allows practitioners to use linear regressions at each leave for predictions (instead of using aggregated statistics of target labels).</p><p>The implementation is slightly different from the paper &#8216;Gradient boosting with piece-wise linear regression trees&#8217; (<a href="https://arxiv.org/abs/1802.05640">link</a>). In particular in <em>LightGBM</em>, during node splitting, the splitting is determined in the same way as without linear models. After the tree structure has been determined, a linear regression model is fitted in each leaf. In the original paper the splitting also include fitting a linear regression model, leading to a much more computationally intensive procedure.</p><div><hr></div><h2>Tackling overfitting with LightGBM parameters</h2><p>A common problem from gradient boosting is overfitting. With LightGBM, one can reduce the variance of the model with the following parameters:</p><ul><li><p>Decrease <code>max_bin</code></p></li><li><p>Decrease <code>num_leaves</code></p></li><li><p>Decrease <code>max_depth</code></p></li><li><p>Increase <code>lambda_l1</code>, <code>lambda_l2</code></p></li><li><p>Increase <code>min_data_in_leaf</code> </p></li><li><p>Increase <code>min_sum_hessian_in_leaf</code></p></li><li><p>Enable data or feature bagging with <code>bagging_fraction</code>, <code>bagging_freq</code> or feature_fraction</p></li></ul><div><hr></div><h2>Hyperparameters</h2><p>While there are many hyperparameters to be tuned in <em>LightGBM</em>, this session explains the most important ones that we would encounter a lot in data science or machine learning problems.</p><blockquote><p><code>boosting</code></p></blockquote><p>Similar to the argument <code>booster</code> in <em>XGBoost</em>, this argument allows you to choose what boosting algorithm to use, rf, gbdt and dart.</p><blockquote><p><code>learning_rate</code></p></blockquote><p>This argument corresponds to learning rate or shrinkage rate. Please refer to our post <a href="https://verticalsolution.substack.com/p/boosting-algorithms-in-machine-learning">Concepts of boosting algorithms in machine learning</a> if needed.</p><blockquote><p><code>num_iterations</code></p></blockquote><p>This argument controls the total number of boosting iteration to be carried out.</p><blockquote><p><code>max_depth</code></p></blockquote><p>This argument refers to the maximum depth that a weak tree can have at each boosting iteration. The higher this value is, the more complex each weak tree could become, and more likely it is for the final model to overfit.</p><blockquote><p><code>bagging_fraction</code></p></blockquote><p>This argument refers to the fraction of rows randomly-sampled for weak tree training at each boosting iteration. This sampling can help reduce overfitting.</p><blockquote><p><code>linear_tree</code></p></blockquote><p>This argument controls whether to run piecewise linear regression at each leaf or not. Please refer to previous session for an introduction of the method.</p><blockquote><p><code>lambda_l1, lambda_l2</code></p></blockquote><p>These two arguments correspond to the two regularisation hyperparameters to control the size of the weak tree.</p><blockquote><p><code>drop_rate</code></p></blockquote><p>This argument is only relevant when boosting is dart. This refers to the probability of dropping a weak tree during a boosting iteration. </p><div><hr></div><h2>Summary</h2><p>In this article we talked about the algorithm of node splitting in LightGBM, two methods that enables its speed-up during boosting and an interesting feature called pairwise linear regression and a list of important parameters to pay attention to during training. We will continue writing more deep-dive content on this topic so stay tuned!</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe to the paid version of the newsletter to enjoy more premium content!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rucD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e31e2dc-e6c6-4593-9713-374d84723046_1100x220.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rucD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e31e2dc-e6c6-4593-9713-374d84723046_1100x220.png 424w, https://substackcdn.com/image/fetch/$s_!rucD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e31e2dc-e6c6-4593-9713-374d84723046_1100x220.png 848w, https://substackcdn.com/image/fetch/$s_!rucD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e31e2dc-e6c6-4593-9713-374d84723046_1100x220.png 1272w, https://substackcdn.com/image/fetch/$s_!rucD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e31e2dc-e6c6-4593-9713-374d84723046_1100x220.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rucD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e31e2dc-e6c6-4593-9713-374d84723046_1100x220.png" width="1100" height="220" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e31e2dc-e6c6-4593-9713-374d84723046_1100x220.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:220,&quot;width&quot;:1100,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:148835,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rucD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e31e2dc-e6c6-4593-9713-374d84723046_1100x220.png 424w, https://substackcdn.com/image/fetch/$s_!rucD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e31e2dc-e6c6-4593-9713-374d84723046_1100x220.png 848w, https://substackcdn.com/image/fetch/$s_!rucD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e31e2dc-e6c6-4593-9713-374d84723046_1100x220.png 1272w, https://substackcdn.com/image/fetch/$s_!rucD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e31e2dc-e6c6-4593-9713-374d84723046_1100x220.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p></p>]]></content:encoded></item><item><title><![CDATA[Practical guides on XGBoost]]></title><description><![CDATA[Deep dive into practicalities of XGBoost]]></description><link>https://newsletter.datascienceletter.com/p/practical-guides-on-xgboost</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/practical-guides-on-xgboost</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Sun, 24 Mar 2024 21:18:38 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/b2c01312-8199-442d-ae49-d22cd48ff5e8_1080x720.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dtqF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dtqF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dtqF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dtqF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">To access more premium content, subscribe to paid version of the newsletter!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>About this post</h2><p>This post explains practical knowledges when using XGBoost. For a more conceptual introduction of what XGBoost is under the hood, please visit another post &#8216;<a href="https://verticalsolution.substack.com/p/boosting-algorithms-in-machine-learning">Concepts of boosting algorithms in machine learning&#8217;</a>.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NXkI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NXkI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 424w, https://substackcdn.com/image/fetch/$s_!NXkI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 848w, https://substackcdn.com/image/fetch/$s_!NXkI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 1272w, https://substackcdn.com/image/fetch/$s_!NXkI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NXkI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png" width="408" height="157" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:157,&quot;width&quot;:408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NXkI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 424w, https://substackcdn.com/image/fetch/$s_!NXkI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 848w, https://substackcdn.com/image/fetch/$s_!NXkI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 1272w, https://substackcdn.com/image/fetch/$s_!NXkI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><div><hr></div><h2>Introduction</h2><p>Extreme Gradient Boosting, <em>XGBoost,</em> is a powerful implementation of tree-based boosting algorithm in python. This note explores its core features of XGBoost, focusing on regularised learning objectives, gradient tree boosting, learning rate parameter, row and column subsampling techniques, and node splitting algorithms. By reading this note, you can understand the essence of <em>XGBoost</em> and its differences to other implementations.</p><p>The official website for <em>XGBoost</em> can be found <a href="https://xgboost.readthedocs.io/">here</a>. Its corresponding paper can be found <a href="https://arxiv.org/abs/1603.02754">here</a>.</p><div><hr></div><h2>Core features</h2><p>While there are a number of key improvements in XGBoost, in this post we focus on the two most important ones: gradient tree boosting with regularisation and approximate node splitting algorithm.</p><h4>Gradient tree boosting with regularisation</h4><p>Like gradient tree boosting, at each boosting iteration, a weak tree is fitted and then added to the final predictor.</p><p>Specifically, when growing a weak tree, the optimal split at each node is determined when there is a maximum loss reduction as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\Delta L_{\\mathrm{split}} = L_{\\mathrm{right}} + L_{\\mathrm{left}} - L_{\\mathrm{no~split}}&quot;,&quot;id&quot;:&quot;ZNIOORXYYX&quot;}" data-component-name="LatexBlockToDOM"></div><p>where </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;L_{k} = -\\frac{1}{2} \\sum_{i=1}^{N_{\\mathrm{node}}} \\frac{\\sum_{j\\in\\mathrm{node~j}} g^2_j}{\\sum_{j\\in\\mathrm{node~j}} h_j+\\lambda} + \\gamma N_{\\mathrm{node}}&quot;,&quot;id&quot;:&quot;XLRQZXUQBH&quot;}" data-component-name="LatexBlockToDOM"></div><p>where k represents left, right or no-split, </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;g_j,~h_j&quot;,&quot;id&quot;:&quot;QFABNXECJX&quot;}" data-component-name="LatexBlockToDOM"></div><p>represents the gradient and hessian for data point j respectively,</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\lambda,~\\gamma&quot;,&quot;id&quot;:&quot;FOJWPABCUE&quot;}" data-component-name="LatexBlockToDOM"></div><p>represents the regularisation hyperparameters to control the size of the weak tree.</p><p>This formula is derived by a minimisation of a second-order Taylor approximation to the regularised loss function. You can find the exact derivation in the original paper.</p><h4>Node splitting</h4><p>To find the best split for a given node during tree fitting, a split finding algorithm is needed to determine how to split the data points in this given node to left and right. </p><p>In particular, XGBoost provides supports to the exact, approximate and histogram-based splitting algorithm.</p><p><em><strong>Exact algorithm</strong></em></p><p>This algorithm enumerates over all features, and for each feature it iterates over all values to find the split that would maximise the loss reduction (defined in previous section). This algorithm is very computationally intensive. To slightly optimise for performance, XGBoost first sorts the data points with respect to feature values and accumulate gradient and hessian statistics accordingly.</p><p><em><strong>Approximate algorithm</strong></em></p><p>Instead of considering all possible splits from all features, the approximate algorithm iterates on a subset of splits for each feature only. </p><p>It is possible to propose candidate splits before and during tree fitting. The former, as known as global splitting, constructs candidate splits before any node splitting procedures. The latter, as known as local splitting, constructs candidate splits for each node splitting. XGBoost supports both methods.</p><p><em><strong>Histogram-based algorithm</strong></em>:</p><p>This candidate splitting method is similar to the one used by LightGBM. Refer to practical guide on LightGBM in the same series to know more about this algorithm.</p><h4>Missing data</h4><p>In presence of missing data in training data, XGBoost handles this by assigning missing data to the side (left or right) such that the loss reduction is maximal. </p><div><hr></div><h2>Hyperparameters</h2><p>While there are many hyperparameters to be tuned in XGBoost, this session explains the most important ones that we would encounter a lot in data science or machine learning problems.</p><p><code>booster</code></p><p>This argument allows you to choose what boosting algorithm to use, <code>gbtree</code>, <code>dart</code> and <code>gblinear</code>. Both <code>gbtree</code> and <code>dart</code> use tree as weak learners, while <code>gblinear</code> uses linear models. The difference between <code>gbtree</code> and <code>dart</code> is that <code>dart</code> includes a random dropout during training phases, which randomly masks weak learners with a certain probability, to further avoid overfitting.</p><p><code>eta</code></p><p>This argument corresponds to learning rate or shrinkage rate. Please refer to our post <a href="https://verticalsolution.substack.com/p/boosting-algorithms-in-machine-learning">Concepts of boosting algorithms in machine learning</a> if needed.</p><p><code>gamma, lambda</code></p><p>These two arguments correspond to the two regularisation hyperparameters to control the size of the weak tree, as explained in the previous session of this post.</p><p><code>max_depth</code></p><p>This argument refers to the maximum depth that a weak tree can have at each boosting iteration. The higher this value is, the more complex each weak tree could become, and more likely it is for the final model to overfit.</p><p><code>max_leaves</code></p><p>This argument refers to the maximum number of leaves that a weak tree can have in each boosting iteration.</p><p><code>subsample</code></p><p>This argument refers to the fraction of rows randomly-sampled for weak tree training at each boosting iteration. This sampling can help reduce overfitting.</p><p><code>colsample_bytree, colsample_bylevel or colsample_bynode</code></p><p>These three arguments controls sampling of columns for weak tree training. This sampling can happen for each tree (bytree), for each tree depth (bylevel) or for each node (bynode).</p><p><code>sampling_method</code></p><p>This argument specifies the sample weight used for training data points. XGBoost supports:</p><ul><li><p>uniform sampling &#8212; each data point has an equal weight</p></li><li><p>gradient-based sampling &#8212; each data point is sampled proportional to the weight</p></li></ul><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;g + \\lambda h^2&quot;,&quot;id&quot;:&quot;LQFZTYWVVW&quot;}" data-component-name="LatexBlockToDOM"></div><p><code>tree_method</code></p><p>This argument specifies the node splitting algorithm to be used. Node splitting algorithms supported by XGBoost are explained in the previous section <em>Node splitting</em>.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe to the paid version of the newsletter to enjoy more premium content! </p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1Jqt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9debad-12eb-4a45-bf69-f48139708c40_1100x220.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1Jqt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9debad-12eb-4a45-bf69-f48139708c40_1100x220.png 424w, https://substackcdn.com/image/fetch/$s_!1Jqt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9debad-12eb-4a45-bf69-f48139708c40_1100x220.png 848w, https://substackcdn.com/image/fetch/$s_!1Jqt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9debad-12eb-4a45-bf69-f48139708c40_1100x220.png 1272w, https://substackcdn.com/image/fetch/$s_!1Jqt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9debad-12eb-4a45-bf69-f48139708c40_1100x220.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1Jqt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9debad-12eb-4a45-bf69-f48139708c40_1100x220.png" width="1100" height="220" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5d9debad-12eb-4a45-bf69-f48139708c40_1100x220.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:220,&quot;width&quot;:1100,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:148835,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1Jqt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9debad-12eb-4a45-bf69-f48139708c40_1100x220.png 424w, https://substackcdn.com/image/fetch/$s_!1Jqt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9debad-12eb-4a45-bf69-f48139708c40_1100x220.png 848w, https://substackcdn.com/image/fetch/$s_!1Jqt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9debad-12eb-4a45-bf69-f48139708c40_1100x220.png 1272w, https://substackcdn.com/image/fetch/$s_!1Jqt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9debad-12eb-4a45-bf69-f48139708c40_1100x220.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p></p>]]></content:encoded></item><item><title><![CDATA[Our list of deep-dive topics on data science]]></title><description><![CDATA[A list of articles to deep dive into basic and advanced data science topics, from data visualisation to exploratory data analysis, to machine learning models.]]></description><link>https://newsletter.datascienceletter.com/p/complete-guide-to-data-science</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/complete-guide-to-data-science</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Sun, 24 Mar 2024 12:03:50 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f87ed7f0-33dd-466c-964b-0a27a69eef45_1080x720.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kO36!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kO36!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!kO36!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!kO36!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!kO36!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kO36!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kO36!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!kO36!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!kO36!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!kO36!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>&#128640; Subscribe to us for more deep dive content on data science, machine learning and artificial intelligence!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.datascienceletter.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h1>Statistics</h1><pre><code><code>This section talks about essential statistics concepts in data science, focusing on how these concepts relate to real-life scenarios.</code></code></pre><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;f98e9ddc-30ab-46d6-a7da-85f9c8b6d906&quot;,&quot;caption&quot;:&quot;&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Selection bias&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:217027086,&quot;name&quot;:&quot;Vertical Solution&quot;,&quot;bio&quot;:&quot;We are a group of data scientists in the tech sector, passionate about solving problems with data. We came from various background, such as particle physics, computational biology, computer science. We are also passionate about writing data science.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cb522d0-4d41-41ac-81a6-7bce18164a16_300x300.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-28T00:15:18.362Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3612eb45-54df-4b77-b25c-83af0e8383db_1080x715.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://newsletter.verticalsolution.io/p/selection-bias&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:143023235,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Vertical Solution&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b7ae4f2-591a-4737-a685-8db7e0968e99_300x300.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;c365c597-423c-44a8-8558-c796254f3cb1&quot;,&quot;caption&quot;:&quot;Consider subscribing to our paid version with a price of one cup of cappuccino per month if you find our content useful!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Practical guides on bootstrapping&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:217027086,&quot;name&quot;:&quot;Vertical Solution&quot;,&quot;bio&quot;:&quot;We are a group of data scientists in the tech sector, passionate about solving problems with data. We came from various background, such as particle physics, computational biology, computer science. We are also passionate about writing data science.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cb522d0-4d41-41ac-81a6-7bce18164a16_300x300.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-05T20:58:59.494Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/99a12c05-609b-4bc7-a45f-fa8995d3b570_1080x720.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://newsletter.verticalsolution.io/p/practical-guides-on-bootstrapping&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:143198053,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Vertical Solution&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b7ae4f2-591a-4737-a685-8db7e0968e99_300x300.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div><hr></div><h1>Causal inference</h1><pre><code><code>This section talks about essential concepts in causal inference, focusing on how these concepts relate to real-life scenarios.</code></code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4C0u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4C0u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 424w, https://substackcdn.com/image/fetch/$s_!4C0u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 848w, https://substackcdn.com/image/fetch/$s_!4C0u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 1272w, https://substackcdn.com/image/fetch/$s_!4C0u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4C0u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg" width="248" height="67.10989010989012" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:394,&quot;width&quot;:1456,&quot;resizeWidth&quot;:248,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4C0u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 424w, https://substackcdn.com/image/fetch/$s_!4C0u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 848w, https://substackcdn.com/image/fetch/$s_!4C0u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 1272w, https://substackcdn.com/image/fetch/$s_!4C0u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;129f5f20-d6e2-4542-9428-f2545917e05b&quot;,&quot;caption&quot;:&quot;&#128640; Subscribe to us to more deep-dive topics on data science, machine learning and artificial intelligence!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Netflix measures the value of subscriber acquisition and retention using causal inference and Markov chain&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:217027086,&quot;name&quot;:&quot;Vertical Solution&quot;,&quot;bio&quot;:&quot;We are a group of data scientists in the tech sector, passionate about solving problems with data. We came from various background, such as particle physics, computational biology, computer science. We are also passionate about writing data science.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cb522d0-4d41-41ac-81a6-7bce18164a16_300x300.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-19T11:17:19.857Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b9462a69-4f80-4d38-a4e7-7f33d055d0c2_10000x5625.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://newsletter.verticalsolution.io/p/beyond-customer-lifetime-valuation&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142753041,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Vertical Solution&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b7ae4f2-591a-4737-a685-8db7e0968e99_300x300.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div><hr></div><h1>Machine learning</h1><p>For more ML-related content, please visit our article list on machine learning <a href="https://newsletter.verticalsolution.io/p/complete-guide-to-machine-learning">here</a>.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kO36!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kO36!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!kO36!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!kO36!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!kO36!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kO36!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!kO36!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!kO36!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!kO36!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!kO36!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f7dc9bc-5cf6-4181-a413-3abdb0c2531b_1128x191.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe to the paid version of the newsletter to enjoy more premium content!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Deep-Dive Topics in Machine Learning]]></title><description><![CDATA[Explore both foundational and advanced machine learning topics &#8212; from linear regression to boosting methods to online learning models like multi-armed bandits.]]></description><link>https://newsletter.datascienceletter.com/p/complete-guide-to-machine-learning</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/complete-guide-to-machine-learning</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Sun, 24 Mar 2024 11:44:41 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/aef9f167-f31d-4f55-96f0-c2db727a4270_5376x3604.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tiuq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tiuq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tiuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tiuq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>If you find that our articles are useful for your machine learning journey, consider subscribing to our paid version with a price of one cup of cappuccino per month.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.datascienceletter.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>&#9889; Large Language Models (LLMs)</h2><p>Large Language Models are transforming the landscape of AI, powering everything from chatbots and code assistants to document summarization and creative generation. In this section, we explore the <strong>theory</strong>, <strong>engineering</strong>, and <strong>practical application</strong> of LLMs&#8212;along with guides to prominent open-source and proprietary models.</p><ul><li><p>&#128736;&#65039; <a href="https://newsletter.datascienceletter.com/p/gemma-3-is-here-googles-open-source">Gemma 3 Is Here &#8212; Google's Open-Source LLMs Just Got a Big Upgrade</a></p></li></ul><div><hr></div><h3>&#9889; Boosting Algorithms</h3><p>Boosting is a powerful ensembling technique that improves model performance by combining weak learners. In this section, we cover both the <strong>theory</strong> and <strong>practice</strong> of boosting, including hands-on guides to popular libraries:</p><ul><li><p>&#128216; <a href="https://newsletter.datascienceletter.com/p/boosting-algorithms-in-machine-learning">Concepts in Boosting Algorithms in Machine Learning</a></p></li><li><p>&#128736;&#65039; <a href="https://newsletter.datascienceletter.com/p/practical-guides-on-lightgbm">Practical Guide to LightGBM</a></p></li><li><p>&#128736;&#65039; <a href="https://newsletter.datascienceletter.com/p/practical-guides-on-xgboost">Practical Guide to XGBoost</a></p></li><li><p>&#128736;&#65039; <a href="https://newsletter.datascienceletter.com/p/practical-guides-on-catboost">Practical Guide to CatBoost</a></p></li><li><p>&#128736;&#65039; <a href="https://newsletter.datascienceletter.com/p/practical-guides-of-boosting-in-scikit">Practical Guide to scikit-learn</a></p></li><li><p>&#128269; <a href="https://newsletter.datascienceletter.com/p/lightgbm-vs-xgboost-vs-catboost">LightGBM vs XGBoost vs CatBoost: A Comparison</a></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qKwW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qKwW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 424w, https://substackcdn.com/image/fetch/$s_!qKwW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 848w, https://substackcdn.com/image/fetch/$s_!qKwW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 1272w, https://substackcdn.com/image/fetch/$s_!qKwW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qKwW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg" width="200" height="45.32967032967033" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:330,&quot;width&quot;:1456,&quot;resizeWidth&quot;:200,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qKwW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 424w, https://substackcdn.com/image/fetch/$s_!qKwW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 848w, https://substackcdn.com/image/fetch/$s_!qKwW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 1272w, https://substackcdn.com/image/fetch/$s_!qKwW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3413a358-a654-494d-9871-a48f7b679f70_4672x1058.svg 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NXkI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NXkI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 424w, https://substackcdn.com/image/fetch/$s_!NXkI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 848w, https://substackcdn.com/image/fetch/$s_!NXkI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 1272w, https://substackcdn.com/image/fetch/$s_!NXkI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NXkI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png" width="202" height="77.73039215686275" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:157,&quot;width&quot;:408,&quot;resizeWidth&quot;:202,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NXkI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 424w, https://substackcdn.com/image/fetch/$s_!NXkI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 848w, https://substackcdn.com/image/fetch/$s_!NXkI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 1272w, https://substackcdn.com/image/fetch/$s_!NXkI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9109669a-5a1a-4aab-86b6-b47988153f0f_408x157.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bbtx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a1dbae9-9157-4deb-9f83-7f3e08a62d17_559x235.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bbtx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a1dbae9-9157-4deb-9f83-7f3e08a62d17_559x235.png 424w, https://substackcdn.com/image/fetch/$s_!Bbtx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a1dbae9-9157-4deb-9f83-7f3e08a62d17_559x235.png 848w, https://substackcdn.com/image/fetch/$s_!Bbtx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a1dbae9-9157-4deb-9f83-7f3e08a62d17_559x235.png 1272w, https://substackcdn.com/image/fetch/$s_!Bbtx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a1dbae9-9157-4deb-9f83-7f3e08a62d17_559x235.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bbtx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a1dbae9-9157-4deb-9f83-7f3e08a62d17_559x235.png" width="195" height="81.97674418604652" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9a1dbae9-9157-4deb-9f83-7f3e08a62d17_559x235.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:235,&quot;width&quot;:559,&quot;resizeWidth&quot;:195,&quot;bytes&quot;:11867,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bbtx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a1dbae9-9157-4deb-9f83-7f3e08a62d17_559x235.png 424w, https://substackcdn.com/image/fetch/$s_!Bbtx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a1dbae9-9157-4deb-9f83-7f3e08a62d17_559x235.png 848w, https://substackcdn.com/image/fetch/$s_!Bbtx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a1dbae9-9157-4deb-9f83-7f3e08a62d17_559x235.png 1272w, https://substackcdn.com/image/fetch/$s_!Bbtx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a1dbae9-9157-4deb-9f83-7f3e08a62d17_559x235.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7kGq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94783bbc-c5a3-4886-9c41-b54f9b8b3576_1200x646.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7kGq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94783bbc-c5a3-4886-9c41-b54f9b8b3576_1200x646.png 424w, https://substackcdn.com/image/fetch/$s_!7kGq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94783bbc-c5a3-4886-9c41-b54f9b8b3576_1200x646.png 848w, https://substackcdn.com/image/fetch/$s_!7kGq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94783bbc-c5a3-4886-9c41-b54f9b8b3576_1200x646.png 1272w, https://substackcdn.com/image/fetch/$s_!7kGq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94783bbc-c5a3-4886-9c41-b54f9b8b3576_1200x646.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7kGq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94783bbc-c5a3-4886-9c41-b54f9b8b3576_1200x646.png" width="156" height="83.98" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94783bbc-c5a3-4886-9c41-b54f9b8b3576_1200x646.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:646,&quot;width&quot;:1200,&quot;resizeWidth&quot;:156,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;scikit-learn - Wikipedia&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="scikit-learn - Wikipedia" title="scikit-learn - Wikipedia" srcset="https://substackcdn.com/image/fetch/$s_!7kGq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94783bbc-c5a3-4886-9c41-b54f9b8b3576_1200x646.png 424w, https://substackcdn.com/image/fetch/$s_!7kGq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94783bbc-c5a3-4886-9c41-b54f9b8b3576_1200x646.png 848w, https://substackcdn.com/image/fetch/$s_!7kGq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94783bbc-c5a3-4886-9c41-b54f9b8b3576_1200x646.png 1272w, https://substackcdn.com/image/fetch/$s_!7kGq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94783bbc-c5a3-4886-9c41-b54f9b8b3576_1200x646.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div><hr></div><h3>&#127920; Reinforcement Learning: Multi-Armed Bandits</h3><p>An introduction to <strong>basic and contextual multi-armed bandit (MAB)</strong> problems &#8212; ideal for understanding online decision-making under uncertainty. Includes algorithm comparisons and use cases in data science and ML.</p><ul><li><p>&#128216; <a href="https://newsletter.datascienceletter.com/p/concepts-of-multi-armed-bandits">Concepts in Multi-Armed Bandits</a></p></li></ul><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ksiT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0d74b69-84bb-41f6-93d6-2bf18606137e_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ksiT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0d74b69-84bb-41f6-93d6-2bf18606137e_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!ksiT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0d74b69-84bb-41f6-93d6-2bf18606137e_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!ksiT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0d74b69-84bb-41f6-93d6-2bf18606137e_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!ksiT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0d74b69-84bb-41f6-93d6-2bf18606137e_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ksiT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0d74b69-84bb-41f6-93d6-2bf18606137e_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0d74b69-84bb-41f6-93d6-2bf18606137e_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ksiT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0d74b69-84bb-41f6-93d6-2bf18606137e_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!ksiT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0d74b69-84bb-41f6-93d6-2bf18606137e_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!ksiT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0d74b69-84bb-41f6-93d6-2bf18606137e_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!ksiT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0d74b69-84bb-41f6-93d6-2bf18606137e_1128x191.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe to the paid version of the newsletter to enjoy more premium content!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[How Netflix measures the value of subscriber acquisition and retention using causal inference and Markov chain]]></title><description><![CDATA[Deep dive into how Netflix uses causal inference and Markov Chain models to evaluate the value of acquisition and retention]]></description><link>https://newsletter.datascienceletter.com/p/beyond-customer-lifetime-valuation</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/beyond-customer-lifetime-valuation</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Tue, 19 Mar 2024 11:17:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28634fb8-481b-4ce6-976d-a1b7fd64ed12_8000x4500.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W0F-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28634fb8-481b-4ce6-976d-a1b7fd64ed12_8000x4500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W0F-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28634fb8-481b-4ce6-976d-a1b7fd64ed12_8000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!W0F-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28634fb8-481b-4ce6-976d-a1b7fd64ed12_8000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!W0F-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28634fb8-481b-4ce6-976d-a1b7fd64ed12_8000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!W0F-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28634fb8-481b-4ce6-976d-a1b7fd64ed12_8000x4500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W0F-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28634fb8-481b-4ce6-976d-a1b7fd64ed12_8000x4500.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28634fb8-481b-4ce6-976d-a1b7fd64ed12_8000x4500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:492140,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!W0F-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28634fb8-481b-4ce6-976d-a1b7fd64ed12_8000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!W0F-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28634fb8-481b-4ce6-976d-a1b7fd64ed12_8000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!W0F-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28634fb8-481b-4ce6-976d-a1b7fd64ed12_8000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!W0F-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28634fb8-481b-4ce6-976d-a1b7fd64ed12_8000x4500.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1><strong>Quick summary</strong></h1><p>Hello &#128075; In this article, I will explain how Netflix measure values of acquiring or retenting a subscriber on the following areas:</p><ul><li><p>Business scenario</p></li><li><p>Customer lifetime value</p></li><li><p>Markov Chain model</p></li><li><p>Incrementality calculation</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dtqF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dtqF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dtqF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dtqF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!dtqF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0726ced8-e8a2-40ac-a24e-508a19ccee84_1128x191.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><blockquote><p><em>&#128640;  Email us @ <a href="http://newsletter.verticalsolution.io/">social@verticalsolution.io</a> if you want us to deep dive into a data science topic!</em></p></blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.datascienceletter.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4C0u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4C0u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 424w, https://substackcdn.com/image/fetch/$s_!4C0u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 848w, https://substackcdn.com/image/fetch/$s_!4C0u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 1272w, https://substackcdn.com/image/fetch/$s_!4C0u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4C0u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg" width="374" height="101.20604395604396" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:394,&quot;width&quot;:1456,&quot;resizeWidth&quot;:374,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;File:Netflix 2015 logo.svg - Wikimedia Commons&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="File:Netflix 2015 logo.svg - Wikimedia Commons" title="File:Netflix 2015 logo.svg - Wikimedia Commons" srcset="https://substackcdn.com/image/fetch/$s_!4C0u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 424w, https://substackcdn.com/image/fetch/$s_!4C0u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 848w, https://substackcdn.com/image/fetch/$s_!4C0u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 1272w, https://substackcdn.com/image/fetch/$s_!4C0u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db64902-387a-40a2-aa00-062c9c117541_1024x277.svg 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><div><hr></div><h2>Business scenario</h2><p>To measure the long-term value of acquiring or retaining a subscriber in Netflix. This is important for subscription-based streaming platform like Netflix, because often one would like to know monetary benefits of certain marketing campaigns or product developments to increase acquisition of new subscribers or retention of existing subscribers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Vqj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F258d2cde-c5e8-458c-9524-8a2cc242894b_1176x460.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Vqj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F258d2cde-c5e8-458c-9524-8a2cc242894b_1176x460.png 424w, https://substackcdn.com/image/fetch/$s_!6Vqj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F258d2cde-c5e8-458c-9524-8a2cc242894b_1176x460.png 848w, https://substackcdn.com/image/fetch/$s_!6Vqj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F258d2cde-c5e8-458c-9524-8a2cc242894b_1176x460.png 1272w, https://substackcdn.com/image/fetch/$s_!6Vqj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F258d2cde-c5e8-458c-9524-8a2cc242894b_1176x460.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Vqj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F258d2cde-c5e8-458c-9524-8a2cc242894b_1176x460.png" width="1176" height="460" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/258d2cde-c5e8-458c-9524-8a2cc242894b_1176x460.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:460,&quot;width&quot;:1176,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Vqj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F258d2cde-c5e8-458c-9524-8a2cc242894b_1176x460.png 424w, https://substackcdn.com/image/fetch/$s_!6Vqj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F258d2cde-c5e8-458c-9524-8a2cc242894b_1176x460.png 848w, https://substackcdn.com/image/fetch/$s_!6Vqj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F258d2cde-c5e8-458c-9524-8a2cc242894b_1176x460.png 1272w, https://substackcdn.com/image/fetch/$s_!6Vqj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F258d2cde-c5e8-458c-9524-8a2cc242894b_1176x460.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Methodology</h2><h4><strong>Need for causal model</strong></h4><p>As subscription is a self-selected behaviour, meaning that we could not control if a user subscribes or not (and hence it is not possible to conduct a AB test), one will need to rely on causal techniques to solve this problem.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8FJl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33360c89-00da-4012-80c7-5859898a5435_1608x266.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8FJl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33360c89-00da-4012-80c7-5859898a5435_1608x266.png 424w, https://substackcdn.com/image/fetch/$s_!8FJl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33360c89-00da-4012-80c7-5859898a5435_1608x266.png 848w, https://substackcdn.com/image/fetch/$s_!8FJl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33360c89-00da-4012-80c7-5859898a5435_1608x266.png 1272w, https://substackcdn.com/image/fetch/$s_!8FJl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33360c89-00da-4012-80c7-5859898a5435_1608x266.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8FJl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33360c89-00da-4012-80c7-5859898a5435_1608x266.png" width="1456" height="241" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/33360c89-00da-4012-80c7-5859898a5435_1608x266.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:241,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8FJl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33360c89-00da-4012-80c7-5859898a5435_1608x266.png 424w, https://substackcdn.com/image/fetch/$s_!8FJl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33360c89-00da-4012-80c7-5859898a5435_1608x266.png 848w, https://substackcdn.com/image/fetch/$s_!8FJl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33360c89-00da-4012-80c7-5859898a5435_1608x266.png 1272w, https://substackcdn.com/image/fetch/$s_!8FJl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33360c89-00da-4012-80c7-5859898a5435_1608x266.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h4><strong>Customer lifetime value (CLV)</strong></h4><p>The naive usage of CLV in this case is likely an over-estimation because a off-service subscriber could resubscribe in the future, hence the CLV from these customers are actually non-zero and have to be considered when calculating the value of acquisition or retention.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7kCH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F876d25b5-d5ed-40b9-ab42-b88c92ce1bd0_1164x506.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7kCH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F876d25b5-d5ed-40b9-ab42-b88c92ce1bd0_1164x506.png 424w, https://substackcdn.com/image/fetch/$s_!7kCH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F876d25b5-d5ed-40b9-ab42-b88c92ce1bd0_1164x506.png 848w, https://substackcdn.com/image/fetch/$s_!7kCH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F876d25b5-d5ed-40b9-ab42-b88c92ce1bd0_1164x506.png 1272w, https://substackcdn.com/image/fetch/$s_!7kCH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F876d25b5-d5ed-40b9-ab42-b88c92ce1bd0_1164x506.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7kCH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F876d25b5-d5ed-40b9-ab42-b88c92ce1bd0_1164x506.png" width="1164" height="506" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/876d25b5-d5ed-40b9-ab42-b88c92ce1bd0_1164x506.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:506,&quot;width&quot;:1164,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7kCH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F876d25b5-d5ed-40b9-ab42-b88c92ce1bd0_1164x506.png 424w, https://substackcdn.com/image/fetch/$s_!7kCH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F876d25b5-d5ed-40b9-ab42-b88c92ce1bd0_1164x506.png 848w, https://substackcdn.com/image/fetch/$s_!7kCH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F876d25b5-d5ed-40b9-ab42-b88c92ce1bd0_1164x506.png 1272w, https://substackcdn.com/image/fetch/$s_!7kCH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F876d25b5-d5ed-40b9-ab42-b88c92ce1bd0_1164x506.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4><strong>The method</strong></h4><p>Netflix data scientists come up with a nice method which combines <strong>Markov chain and causal inference</strong> to solve the problem (For an quick introduction to Markov chain, you can visit the wikipedia page <a href="https://en.wikipedia.org/wiki/Markov_chain">here</a>).</p><p>A Markov chain can be specified by the set of states and transition probabilities.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dwjU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44e96962-2b98-4ab8-ba76-393ef3bfc1f5_1600x965.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dwjU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44e96962-2b98-4ab8-ba76-393ef3bfc1f5_1600x965.webp 424w, https://substackcdn.com/image/fetch/$s_!dwjU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44e96962-2b98-4ab8-ba76-393ef3bfc1f5_1600x965.webp 848w, https://substackcdn.com/image/fetch/$s_!dwjU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44e96962-2b98-4ab8-ba76-393ef3bfc1f5_1600x965.webp 1272w, https://substackcdn.com/image/fetch/$s_!dwjU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44e96962-2b98-4ab8-ba76-393ef3bfc1f5_1600x965.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dwjU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44e96962-2b98-4ab8-ba76-393ef3bfc1f5_1600x965.webp" width="1456" height="878" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44e96962-2b98-4ab8-ba76-393ef3bfc1f5_1600x965.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:878,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36126,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!dwjU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44e96962-2b98-4ab8-ba76-393ef3bfc1f5_1600x965.webp 424w, https://substackcdn.com/image/fetch/$s_!dwjU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44e96962-2b98-4ab8-ba76-393ef3bfc1f5_1600x965.webp 848w, https://substackcdn.com/image/fetch/$s_!dwjU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44e96962-2b98-4ab8-ba76-393ef3bfc1f5_1600x965.webp 1272w, https://substackcdn.com/image/fetch/$s_!dwjU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44e96962-2b98-4ab8-ba76-393ef3bfc1f5_1600x965.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>State definition</strong> &#8212; each subscriber (on or off-service) is represented by a state <em>s</em>. This state represents the number of consecutive billing cycles that the subscriber is in or not in. Each subscriber is in one of the state: </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;{0, s_1, s_2, ... , s_N, s_{-1}, ... ,s_{-M}}&quot;,&quot;id&quot;:&quot;TWDKTJMAPO&quot;}" data-component-name="LatexBlockToDOM"></div><p>where a positive index i means that the subscriber has subscribed to the service with i consecutively billing cycle, while negative index -i means instead the subscribers has not subscribed to the service with i consecutively billing cycle. N and M are assumed to be very large.</p><p><strong>Transition probabilities</strong> &#8212; transition probabilities between states </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;p(s_{i+1} | s_{i})&quot;,&quot;id&quot;:&quot;KLAMRBJVRZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>can also be determined by calculating empirical means or fitting a model using the historical data.</p><p><strong>Value of a subscriber &#8212; </strong>the value function (often used in the Markov chain problem or reinforcement learning) describes the cumulative expected rewards from a state s. In this case, given a state s, the value function can be written as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;V_s = \\sum_{k=1}^{\\infty} r^k~I(\\mathrm{subscribed~at~kth~cycle~given ~s})~c_k&quot;,&quot;id&quot;:&quot;PBPQWTMNUZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>where r is the discounted factor (often used in MC or RL), I is a indicator function equal to 1 if the household is subscribed to the service at kth cycle given the current state s, and c is the price of the service. The value function can be obtained by running numerical simulations or solving the Bellman&#8217;s equation.</p><p><strong>Calculating incremental values &#8212; </strong>After calculating the value functions or for all the states, we can then calculate the incremental value in various acquisition and retention scenarios. The value of acquiring a new subscriber is </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;V_{s_1} - V_{0}&quot;,&quot;id&quot;:&quot;JDACDYDKMJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Similarly, the value of acquiring a subscriber who has stopped the service for k months is</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;V_{s_1} - V_{s_{-k}}&quot;,&quot;id&quot;:&quot;RYLXBAYOQV&quot;}" data-component-name="LatexBlockToDOM"></div><div><hr></div><h2>Summary</h2><p>This post summarises the essential concepts used by Netflix data scientists on the business problem of estimating the value of customer acquisition and retention. For those who are interested more detail, you can access the paper <a href="https://dl.acm.org/doi/abs/10.1145/3485447.3512058">here</a>.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!waJJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9aa5fab-0eaa-4dff-847e-9148c4354ffe_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!waJJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9aa5fab-0eaa-4dff-847e-9148c4354ffe_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!waJJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9aa5fab-0eaa-4dff-847e-9148c4354ffe_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!waJJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9aa5fab-0eaa-4dff-847e-9148c4354ffe_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!waJJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9aa5fab-0eaa-4dff-847e-9148c4354ffe_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!waJJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9aa5fab-0eaa-4dff-847e-9148c4354ffe_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9aa5fab-0eaa-4dff-847e-9148c4354ffe_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!waJJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9aa5fab-0eaa-4dff-847e-9148c4354ffe_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!waJJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9aa5fab-0eaa-4dff-847e-9148c4354ffe_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!waJJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9aa5fab-0eaa-4dff-847e-9148c4354ffe_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!waJJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9aa5fab-0eaa-4dff-847e-9148c4354ffe_1128x191.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">&#128640; <em>Thank you for reading! Give us some feedback @ <a href="http://newsletter.verticalsolution.io/">social@verticalsolution.io</a> &#128640; </em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Concepts of boosting algorithms in machine learning]]></title><description><![CDATA[What are the differences between Adaboost, gradient boosting and Newton boosting?]]></description><link>https://newsletter.datascienceletter.com/p/boosting-algorithms-in-machine-learning</link><guid isPermaLink="false">https://newsletter.datascienceletter.com/p/boosting-algorithms-in-machine-learning</guid><dc:creator><![CDATA[Data Science Letter]]></dc:creator><pubDate>Tue, 19 Mar 2024 11:11:40 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ba19bef9-a9be-4e7d-ad40-c8f87e614f2d_1080x720.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JQCC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6573a9e0-3fbb-463a-bb7d-39e8fcdde072_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JQCC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6573a9e0-3fbb-463a-bb7d-39e8fcdde072_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!JQCC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6573a9e0-3fbb-463a-bb7d-39e8fcdde072_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!JQCC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6573a9e0-3fbb-463a-bb7d-39e8fcdde072_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!JQCC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6573a9e0-3fbb-463a-bb7d-39e8fcdde072_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JQCC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6573a9e0-3fbb-463a-bb7d-39e8fcdde072_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6573a9e0-3fbb-463a-bb7d-39e8fcdde072_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238339,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JQCC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6573a9e0-3fbb-463a-bb7d-39e8fcdde072_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!JQCC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6573a9e0-3fbb-463a-bb7d-39e8fcdde072_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!JQCC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6573a9e0-3fbb-463a-bb7d-39e8fcdde072_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!JQCC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6573a9e0-3fbb-463a-bb7d-39e8fcdde072_1128x191.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><h2><strong>What is boosting?</strong></h2><p>In machine learning, boosting is an ensembling technique to combine a set of weak learners to form a stronger learner with smaller bias and variances. A weak (strong) learner refers to a predictor, for example a classifier or a regressor, that is weakly (strongly) correlated with the target label.</p><p>Although boosting does not require specific forms of algorithms, many boosting algorithms generally learn each weak predictor iteratively and in the end combine all weak predictors to form one strong predictor.</p><p>In mathematical form, it can be expressed as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;F_n(x) = \\sum_{k=1}^{n} w_k f_{k}(x)&quot;,&quot;id&quot;:&quot;WHXPHUDOMC&quot;}" data-component-name="LatexBlockToDOM"></div><p>where </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;f_n(x)&quot;,&quot;id&quot;:&quot;PBHSSNJUDF&quot;}" data-component-name="LatexBlockToDOM"></div><p> is the final predictor at n-th iteration, </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;f_k(x)&quot;,&quot;id&quot;:&quot;DFQMGXACKE&quot;}" data-component-name="LatexBlockToDOM"></div><p>is the weak predictor at k-th iteration, and </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w_k&quot;,&quot;id&quot;:&quot;HBPWWHFGVO&quot;}" data-component-name="LatexBlockToDOM"></div><p>is the weight associated to the k-th weak predictor, which will be calculated differently depending on the boosting algorithm used.</p><p>In the following sections, we will explain how weak predictors and weights are determined for each boosting algorithm.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!38lG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc609b96e-6f53-40c9-a547-b13b3c2d04e0_960x540.svg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!38lG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc609b96e-6f53-40c9-a547-b13b3c2d04e0_960x540.svg 424w, https://substackcdn.com/image/fetch/$s_!38lG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc609b96e-6f53-40c9-a547-b13b3c2d04e0_960x540.svg 848w, https://substackcdn.com/image/fetch/$s_!38lG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc609b96e-6f53-40c9-a547-b13b3c2d04e0_960x540.svg 1272w, https://substackcdn.com/image/fetch/$s_!38lG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc609b96e-6f53-40c9-a547-b13b3c2d04e0_960x540.svg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!38lG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc609b96e-6f53-40c9-a547-b13b3c2d04e0_960x540.svg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c609b96e-6f53-40c9-a547-b13b3c2d04e0_960x540.svg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;ensemble_boosting_wikipedia&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="ensemble_boosting_wikipedia" title="ensemble_boosting_wikipedia" srcset="https://substackcdn.com/image/fetch/$s_!38lG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc609b96e-6f53-40c9-a547-b13b3c2d04e0_960x540.svg 424w, https://substackcdn.com/image/fetch/$s_!38lG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc609b96e-6f53-40c9-a547-b13b3c2d04e0_960x540.svg 848w, https://substackcdn.com/image/fetch/$s_!38lG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc609b96e-6f53-40c9-a547-b13b3c2d04e0_960x540.svg 1272w, https://substackcdn.com/image/fetch/$s_!38lG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc609b96e-6f53-40c9-a547-b13b3c2d04e0_960x540.svg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Illustration of boosting algorithms from wikipedia</figcaption></figure></div><div><hr></div><h2><strong>Adaboost</strong></h2><p>While there have been many developments on boosting algorithms, one of the earliest significant boosting algorithms is as known as <strong>the Adaboost algorithm</strong>. </p><h4>Learning weak predictors</h4><p>The core concept of Adaboost corresponds to that, at each iteration, a weak predictor <strong>is fitted to a weighted dataset</strong>, instead of the original dataset. The weight for each data point at a particular iteration is determined by its residual. Typically, the larger the residual, the larger the weight is given. This is to allow the weak predictors at later iterations to focus on correcting prediction errors of more difficult samples.</p><h4>Shinkage</h4><p>As iterations increase, more and more emphasis will be placed on difficult samples, hence more susceptible to the presence of outliers.</p><p>One way to ensure the stability of the algorithm against this is by employing learning rate, or also as known as shrinkage rate. The idea is that the contribution from weak learners in additional iterations will be shrinked by this rate, controlling the influences that the later-fitted learners have on the overall predictions.</p><p>The mathematical formula for this can be written as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;F_n(x) = f_0(x) + w \\sum^n_{k=0} f_{k}(x)&quot;,&quot;id&quot;:&quot;ZPMNQFMRHN&quot;}" data-component-name="LatexBlockToDOM"></div><p>where n is the number of boosting iterations and w is the shrinkage rate.</p><div><hr></div><h2><strong>Gradient boosting</strong></h2><p>Gradient boosting is a special kind of boosting algorithm that, instead of fitting at each time step a weak learner to residuals from previous iterations, it fits instead to  <strong>negative gradients</strong> at that particular iteration. (Yes, even if you are training a classification model, at each iteration, except the very first iteration, the weak predictor is actually a regression model predicting negative gradients.)</p><p>In mathematical form, the update rule for the final predictor is </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;F_k(x) = F_{k-1}(x) + \\eta_k f_k(x)&quot;,&quot;id&quot;:&quot;MOGSTTNPDY&quot;}" data-component-name="LatexBlockToDOM"></div><p>where similar to above,</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;F_{m}(x)&quot;,&quot;id&quot;:&quot;SGDIPJVZBS&quot;}" data-component-name="LatexBlockToDOM"></div><p> is the final predictor (not the weak predictor!) at k-th iteration, </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;f_k(x)&quot;,&quot;id&quot;:&quot;CZEYKXWCBV&quot;}" data-component-name="LatexBlockToDOM"></div><p>is the weak predictor at k-th iteration, and </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\eta_k&quot;,&quot;id&quot;:&quot;CWGFKJETEQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>is the shrinkage rate at k-th iteration.</p><p>When the weaker learner f is trained on negative gradients as target label, the </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;F_k(x) \\approx F_{k-1}(x) - \\eta_k \\left.\\frac{\\partial L}{\\partial F}  \\right |_{F_{m-1}(x)}&quot;,&quot;id&quot;:&quot;KRHYIPNBDY&quot;}" data-component-name="LatexBlockToDOM"></div><p>This formula looks exactly like iteration updates with gradient descent optimisation. In fact it can be shown that this iteration update is actually mathematically equivalent to gradient descent in the function space, hence the name gradient boosting.</p><p>It can be shown that this approach can generalize to various objective functions and is equivalent to performing gradient descent in the functional space. Popular open-source packages such as scikit-learn use gradient boosting in various forms.</p><h2><strong>Newton boosting</strong></h2><p>Similar to gradient boosting, at each boosting iteration, Newton boosting fits a weak predictor to a modified target label. Different from gradient boosting, instead of using negative gradients as target label, it uses negative gradients and hessian matrices to perform updates in each iteration:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;F_k(x) = F_{k-1}(x) - \\eta_k  \\frac{\\left. \\frac{\\partial L}{\\partial F}\\right|_{F_{k-1}(x)}}{\\left. \\frac{\\partial^2 L}{\\partial F^2}\\right|_{F_{k-1}(x)}}&quot;,&quot;id&quot;:&quot;YDVMOIODVY&quot;}" data-component-name="LatexBlockToDOM"></div><p>Again, like gradient boosting, this update ressembles Newton&#8217;s method and corresponds to Newton's descent in functional space.</p><h2><strong>Gradient tree boosting</strong></h2><p>Actually many of the most popular open-source tree-based gradient boosting algorithms nowadays implement gradient boosting a bit differently.</p><p>In particular, like ordinary gradient boosting, at each boosting iteration, a tree is grown and added to the final predictor.</p><p>However, when growing a tree at each boosting round, the optimal split in each node is determined by maximising the loss reduction calculated by this formula:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\Delta L_{\\mathrm{split}} = L_{\\mathrm{right}} + L_{\\mathrm{left}} - L_{\\mathrm{no~split}}&quot;,&quot;id&quot;:&quot;LMMTQSYSNI&quot;}" data-component-name="LatexBlockToDOM"></div><p>where</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;L_{k} = -\\frac{1}{2} \\sum_{i=1}^{N_{\\mathrm{node}}} \\sum_{j\\in\\mathrm{node~j}} g_j&quot;,&quot;id&quot;:&quot;PLASVPSYTC&quot;}" data-component-name="LatexBlockToDOM"></div><p>where g is the gradient for data point j respectively. This formula is derived by a minimisation of a first-order Taylor approximation to the loss function. </p><h2><strong>Newton tree boosting</strong></h2><p>Similar to gradient tree boosting, one can also come up with an analogy and define a more advanced boosting algorithm which takes into the account the hessian of the loss functions.</p><p>In particular, when growing a tree at each boosting round, we use instead the second-order Taylor approximation to the loss function. All that is needed to do is to use a different loss function for node splitting.</p><p>In other words, similar to gradient tree boosting, at each node split, the optimal split is defined by the split giving the maximum loss reduction:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;L_{k} = -\\frac{1}{2} \\sum_{i=1}^{N_{\\mathrm{node}}} \\frac{\\sum_{j\\in\\mathrm{node~j}} g^2_j}{\\sum_{j\\in\\mathrm{node~j}} h_j+\\lambda} + \\gamma N_{\\mathrm{node}}&quot;,&quot;id&quot;:&quot;JEGALGBZIR&quot;}" data-component-name="LatexBlockToDOM"></div><p>where</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;L_{k} = -\\frac{1}{2} \\sum_{i=1}^{N_{\\mathrm{node}}} \\frac{\\sum_{j\\in\\mathrm{node~j}} g^2_j}{\\sum_{j\\in\\mathrm{node~j}} h_j}&quot;,&quot;id&quot;:&quot;BKDWGCJSXJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Popular open-source packages such as LightGBM and XGBoost use Newton boosting (apart from regularisation) for their main boosting method.</p><h2>Summary</h2><p>Understanding the fundamental concepts of boosting algorithms presented in this article is crucial for grasping the design choices behind advanced boosting algorithms like LightGBM and XGBoost.</p><p>These advanced algorithms often build upon the core principles of boosting explained here, but with additional optimisations on efficiency, scalability, and handling complex data structures. For instance, LightGBM and XGBoost leverage decision trees as weak learners and implement efficient algorithms for finding optimal splits within those trees. Their advancements lie in techniques like gradient-based decision tree learning, gradient-based sampling, and regularization to prevent overfitting.</p><p>By understanding the fundamental building blocks of boosting explored in this article, you can gain a deeper appreciation for the design philosophies behind cutting-edge boosting algorithms and how they achieve superior performance!</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.datascienceletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe to the paid version of the newsletter to enjoy more premium content!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tiuq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tiuq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tiuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png" width="1128" height="191" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:191,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tiuq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 424w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 848w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1272w, https://substackcdn.com/image/fetch/$s_!tiuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10ac92f-cd9f-428b-ae78-5cb1e885d925_1128x191.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div>]]></content:encoded></item></channel></rss>