Optimizing Resource Utilization of Tree-Based Classification Algorithms in Apache Spark

Tree-based machine learning algorithms like Decision Trees and Random Forest are among the most popular and effective types of algorithms for classification problems. In deployment, however, running such algorithms on large volumes of data is often resource intensive. So optimizing usage of resources is a priority for a ML-based product such as Tellius. In this blog, I will illustrate an example of high memory utilization, how we tackled this problem in Apache Spark, and shared our innovation with the open source community.

Tree Algorithms in Tellius

Tellius uses Tree based algorithms heavily across different parts of the platform. One of the key areas where this gets used is Segment Insights.

As part of the Tellius automated insights capability, Segment Insights allows a user to discover the key factors driving a given behavior or a metric across all the business data. For example, a retailer may want to know “what’s driving orders over $500” and then come to learn that certain slices of customers, products, regions, or salespeople are most correlated to large orders. They can use that knowledge to encourage high-value transactions through promotions and other activities.

For illustrative purposes, I will refer to a common machine learning data set of census data available from the University of California Irvine. The task is to analyze the data to determine the factors which most correlate to people who make over $50k a year and in turn create a model that can best make such a prediction.

In Tellius we would set up the question as shown below.

This question internally is converted to a classification exercise and machine learning algorithms will be applied. In this analysis, tree algorithms are among the key components.

In addition to Insights, Tellius also uses such algorithms in AutoML and Point-and-Click custom machine learning components.

RAM Usage of Tree Algorithms

Tellius uses Apache Spark engine for data and analytics processing. Tree algorithms in Spark generate quite a bit of intermediate data during the analysis. By default, this intermediate data is stored in RAM, and when processing large data sets, can cause memory usage to shoot up and lead to out of memory issues.

Running Spark Decision Tree

The screenshot below shows how resources are utilized when running the decision tree algorithm on the census data. The initial size of the data set is 1 GB.

In the screenshot you can see two things:

    • FileScan csv – This is the cached raw data. Even though the original size is 1 GB on disk, it only occupies 250 MB in the memory. This is because of compression.
    • MappedRDD – This is the intermediate data written by the Decision Tree algorithm. It takes over 4 GB, or 4 times the raw data. Out of this 4 GB, 3.4 GB sits in main memory. This is a huge increase in memory usage. As it uses MEMORY_AND_DISK storage level as default, it uses as much as memory as possible. If it cannot fit then it will spill over to disk. In our case, it occupies about another 1 GB on disk.

This is good for deployments where plenty of RAM is available. But what about disk-based deployments?

The Need for Disk-Based Deployments

In Tellius, we believe that every organization should be able to leverage the power of ML irrespective of data size or resource constraints. So we like to have Tellius run even in places where there is scarcity of RAM. This situation is very common, for example, in Hadoop clusters where there is a lot of disk compared to RAM.

For a very interactive system like Tellius, RAM is a precious resource. We like it to be used for the tasks such as search-driven conversational analytics and interactive query, where users expect instant response regardless of the size of data they are exploring, instead of machine learning jobs that run analysis in the background. Smart utilization of RAM improves overall responsiveness and user experience of the platform.

Whenever Spark sees there is not enough RAM to cache this intermediate data, it will start removing other datasets from the cache which will make interactive querying slow for those datasets. This is not desirable. Users can to wait a little longer for ML jobs but not for interactive searches.

Lack of User Control for Intermediate Data

So it will be beneficial to choose how intermediate data is stored depending upon what type of analysis is being performed. The problem is that by default Spark does not expose any mechanism for controlling this storage. This makes it challenging to run these popular algorithms on large data in disk-based deployments.

Configurable Storage for Intermediate Data

To solve this problem in Tellius, we have modified the Spark source code to make storing intermediate data flexible. The below is the extra API we have added to the algorithms.


The above API allows to set the different storage levels needed. This allows us to control how to store the intermediate data. This option is available for all the tree algorithms.

Below is the screenshot after we enable this option.

As we can observe from the above screenshot, all the intermediate data is cached on disk.

Contribution to Open Source

At Tellius we believe in contributing to open source. So we have raised PR for upstream Spark so the community can benefit from our contribution.

Here is link for the PR on github: https://github.com/apache/spark/pull/17972


Tree algorithms are among the most well known classification algorithms. Tellius has made them flexible to run on any kind of deployment and available to a wide variety of customers.


Read Similar Posts

  • Navigating Generative AI in Pharma: Insights from Top Industry Leaders
    Engineering Blog

    Navigating Generative AI in Pharma: Insights from Top Industry Leaders

    During a PMSA 2024 Q&A session, leaders from Novartis, Pfizer, UCB Biopharma, and Boehringer-Ingelheim covered the current state of GenAI and its immense potential to transform the pharmaceuticals industry. So, what's in store?

  • Leveraging AI Analytics for Inventory Optimization
    Engineering Blog

    Leveraging AI Analytics for Inventory Optimization

    AI analytics can help significantly improve inventory management analytics by introducing advanced capabilities to enhance accuracy, adaptability, and efficiency.

  • 10 AI Analytics Myths, Demystified
    Engineering Blog

    10 AI Analytics Myths, Demystified

    Here's how analytics and business leaders can gain a better picture of AI’s strengths and weaknesses when it comes to analytics uses.