Felipe Veloso

Jan 3

7 min read

Real cases of Machine Learning at a Big Scale

Is nothing strange that the technological industry is looking to create more automated solutions that help make different decisions (recommendations, projections, estimates and smart decisions makers) supported by Machine Learning. To generate these solutions involves a great deal of previous and post process just for Machine Learning to acquire the data, process it, store it, train models, monitor and deploy them and to retrain them, just to name a few.

As I commented on a previous post, I work at an intelligence logistic company called www.simpliroute.com where one of the areas I focus on is in Machine Learning in the pipeline end-2-end (gathering data, cleaning, training, deployment and retrain).

The problem that it’s tried to be solved with Machine Learning is to better the input required by the VRP algorithm -Rich VRP. A relevant point is the travel times between points, key information to establish a good route planning. In simpler words.

A prediction based on the travel time of a vehicle on the city streets is done using information from GPS signals that are projected to create a graph of the city, and to have preset travel times with a greater precision to what the driver will really encounter on the street.

This is based primarily by the historic GPS readings stored in our data warehouse and one of those reasons is we used airbyte in some data gathering process.

First Machine Learning solutions and Big Data in Real Settings

To achieve the goal of training a Machine Learning model with an historic base that surpasses 7000 million GPS (with a daily intake superior to 15m of pings) and a volumetry superior to 1TB, is not a simple problem to obtain statistics, to analyze data and to clean them, all of this in a continuum GPS stream (via pubsub-dataflow-bigquery) stored in bigquery.

To achieve this, it has been a series of steps that I’ll try to resume. The data analysis phase and previous features to be ingested in bigquery were analyzed by smaller segments (a sample of the large quantity of GPS that exist, in ideal situations such as no rain, protests, numerous variations and in just one sector). And based in this sample the first Machine Learning models were generated and certain metrics were obtained.

The following models were built with the available data:

  • Tensorflow
  • RandomForest
  • Xgboost
  • LinearRegression (Base Line)

It’s worth saying that to support the metrics and results of these models the tool Mlflow was added, which you can check in the post that tries to support Mlops growing.

Soon a paper will be published that initiated this solution.

Travel time estimation and prediction using GPS data developed by: Javiera Morales Benza, Cristian Cortés, Victor Gonzalez, and Alvaro Echeverria.

And the Big Data?

The challenge isn’t easy. To work with 3TB of RAM in distributed systems with a controlled budget is a task that requires expertise and care (calculated from dataproc with mllib), specially when your models have only processed samples, and the dataset don’t know the time series generated by the intake of GPS.

Decisions were made for the next phase, all achieved by analytics generation, data cleaning pipelines and a lot of bigquery analysis (trial and error takes on a key role in the cost here). Around 2.5 petabyte were processed until a solution was found that allowed us to generate a pipeline (5 stages orchestrated with bigquery and Google Composer), including the generation of geographical polygons based on GPS latitude and longitude and partial analysis of outliers on our dataset.

After orchestrating the creation of our definitive dataset, our work in training a dataset with about 4000 million registers (around 1TB) begins.

We opted for several ways that I will comment on,

  • Bigquery ML
  • Vertex
  • Kubeflow
  • Spark + Koalas + Mllib + PySpark
  • Tensorflow
  • Xgboost

What we’ve learned,

Having created several models and ways of working with models with Big Data, now comes the moment to make tangible what we learned.


It was chosen as a good practice, to opt for the way of the automl of bigquery ml, because it was easy to generate a baseline for the metrics, after that a bit of hyper parameterization and trying to deploy in productive environments, (it’s the process we’re currently living in). After the road traveled is refreshing to obtain predictions based on our dataset. In less than 30 minutes we generate 2 models with diverse parameters, and we predict in an almost miraculous way.

With this, the pressure of any organization to obtain tangible results of ML models is shortened significantly. The models aren’t perfect and require a lot of iteration, but it’s because of this that a quicker result may help decide if the problem we’re trying to solve is worth it.


Our tests with Vertex weren’t really satisfactory. To do this type of automl despite having a fully developed dataset didn’t really help. We trained more than 10 hours multiple times, several datasets less than 100GP in size were created (significantly less than our real dataset).


It’s one of my favorite tools along with mlflow. Kubeflow and its variant Kubeflow pipelines in GCP allowed us to explore new distributed solutions. The problem is that it requires knowledge and dedication to adapt the already made models (except for tensorflow) and, having a more Data Engineering oriented role than Machine Learning Engineering, mi dedication to Machine Learning Pipelines is not complete. The technique necessary to adapt and implement Kubeflow is not a realistic option for growing Startups nor a trade off that’s worth it in the first deliveries of a Machine Learning model.

Spark Ecosystem

Dataproc (a spark on GCP cluster) was a solution that we widely explored not just for training but for statistical analyses as well, which provide us with a simpler BQML. Here our results were good, but with no previous approach about were, how and when to train, the library adaptation to pyspark (plus a learning curve) or with things like distributed systems left us with a bitter taste when we tried to work on this ecosystem.

Now, thanks to management and support from the CTO of option I managed to adopt a bit more on the Databricks ecosystem, which includes Spark based solutions same as dataproc but with a huge ecosystem for Data Warehousing, Machine Learning and Mlops solutions. Although it involves high costs for the infrastructure and even more for training.

Notebooks — Kubernetes — Clusters

The last and more traditional option of training was what we already knew. To put 1TB of memory in computers is really expensive and difficult if there’s not a previous approach to this problem.

A high-performance notebook was created that yield results in over 72 hours, which discouraged us tremendously, (having learned that the process were not optimized, despite being just training since our data pipeline already solves everything regarding quality and data management).

Kubernetes with HPA or Managed Instances left us feeling in a similar way. Too many resources and partial solutions and way too many experiments on mlflow (the price of this tool). I don’t want to analyze this too much, but we didn’t manage to train the full dataset.


The Machine Learning training in Big Data uses a lot of resources from data analysis to a productive model. Mlops wasn’t covered in this post (ideally with Kubeflow, Mlflow or Kedro) nor were the artifact generating processes, data lifecycle, monitoring, etc. What I want to say is that there is no fixed recipe in how to direct or manage ML projects. A great deal of companies desists in achieving being Data Driven/ AI Ready / Predictors. What we learned left us with 3 things that I’d like to summarize.

  • All the roles are important. Data Scientist, Data Engineers, AI/ML Engineers, MLops and Software Engineers. Everyone has a role to do, and many times we forget that this is a discipline that involves so much knowledge that it’s hard to find someone that has the knowledge and understands every ecosystem, tool, and programming language. (If you do, be careful)
  • To tackle ML problems, we must start simple: a dirty dataset on BigQuery ML (or any automl tool), to generate a baseline, take it to production, monitoring and after weeks, test if the model solves the problem or not (and if it can be improved or not). Or complex heuristics are the solution to the problem (maybe you should’ve done this before leaving Machine Learning)
  • To take a path or philosophy. Sadly, there are many ways to tackle Big Data and Machine Learning problems whether it’s by author, cloud provider, ML provider, studies, self-learning, tutorial video or previous experiences. There’s not a perfect and repeatable recipe. We’re still on the initial stages and we’re learning as we go. Choose a path, repeat, and take that model to production, then assess if it could be improved.