Tree-based machine learning algorithms like Decision Trees and Random Forest are among the most popular and effective types of algorithms for classification problems. In deployment, however, running such algorithms on large volumes of data is often resource intensive. So optimizing usage of resources is a priority for a ML-based product such as Tellius. In this blog, I will illustrate an example of high memory utilization, how we tackled this problem in Apache Spark, and shared our innovation with the open source community.
In This Post
Tree Algorithms in Tellius
Tellius uses Tree based algorithms heavily across different parts of the platform. One of the key areas where this gets used is Segment Insights.
As part of the Tellius automated insights capability, Segment Insights allows a user to discover the key factors driving a given behavior or a metric across all the business data. For example, a retailer may want to know “what’s driving orders over $500” and then come to learn that certain slices of customers, products, regions, or salespeople are most correlated to large orders. They can use that knowledge to encourage high-value transactions through promotions and other activities.
For illustrative purposes, I will refer to a common machine learning data set of census data available from the University of California Irvine. The task is to analyze the data to determine the factors which most correlate to people who make over $50k a year and in turn create a model that can best make such a prediction.
In Tellius we would set up the question as shown below.
This question internally is converted to a classification exercise and machine learning algorithms will be applied. In this analysis, tree algorithms are among the key components.
In addition to Insights, Tellius also uses such algorithms in AutoML and Point-and-Click custom machine learning components.
RAM Usage of Tree Algorithms
Tellius uses Apache Spark engine for data and analytics processing. Tree algorithms in Spark generate quite a bit of intermediate data during the analysis. By default, this intermediate data is stored in RAM, and when processing large data sets, can cause memory usage to shoot up and lead to out of memory issues.
Running Spark Decision Tree
The screenshot below shows how resources are utilized when running the decision tree algorithm on the census data. The initial size of the data set is 1 GB.
In the screenshot you can see two things:
- FileScan csv – This is the cached raw data. Even though the original size is 1 GB on disk, it only occupies 250 MB in the memory. This is because of compression.
- MappedRDD – This is the intermediate data written by the Decision Tree algorithm. It takes over 4 GB, or 4 times the raw data. Out of this 4 GB, 3.4 GB sits in main memory. This is a huge increase in memory usage. As it uses MEMORY_AND_DISK storage level as default, it uses as much as memory as possible. If it cannot fit then it will spill over to disk. In our case, it occupies about another 1 GB on disk.
This is good for deployments where plenty of RAM is available. But what about disk-based deployments?
The Need for Disk-Based Deployments
In Tellius, we believe that every organization should be able to leverage the power of ML irrespective of data size or resource constraints. So we like to have Tellius run even in places where there is scarcity of RAM. This situation is very common, for example, in Hadoop clusters where there is a lot of disk compared to RAM.
For a very interactive system like Tellius, RAM is a precious resource. We like it to be used for the tasks such as search-driven conversational analytics and interactive query, where users expect instant response regardless of the size of data they are exploring, instead of machine learning jobs that run analysis in the background. Smart utilization of RAM improves overall responsiveness and user experience of the platform.
Whenever Spark sees there is not enough RAM to cache this intermediate data, it will start removing other datasets from the cache which will make interactive querying slow for those datasets. This is not desirable. Users can to wait a little longer for ML jobs but not for interactive searches.
Lack of User Control for Intermediate Data
So it will be beneficial to choose how intermediate data is stored depending upon what type of analysis is being performed. The problem is that by default Spark does not expose any mechanism for controlling this storage. This makes it challenging to run these popular algorithms on large data in disk-based deployments.
Configurable Storage for Intermediate Data
To solve this problem in Tellius, we have modified the Spark source code to make storing intermediate data flexible. The below is the extra API we have added to the algorithms.
The above API allows to set the different storage levels needed. This allows us to control how to store the intermediate data. This option is available for all the tree algorithms.
Below is the screenshot after we enable this option.
As we can observe from the above screenshot, all the intermediate data is cached on disk.
Contribution to Open Source
At Tellius we believe in contributing to open source. So we have raised PR for upstream Spark so the community can benefit from our contribution.
Here is link for the PR on github: https://github.com/apache/spark/pull/17972
Tree algorithms are among the most well known classification algorithms. Tellius has made them flexible to run on any kind of deployment and available to a wide variety of customers.