10 Minute Read

Published Dec 2022

↓ Click to expand


While the sexy part of machine learning is what everyone focuses on, the data part of machine learning is just as important.

These days you can create a working regression model, forest, or neural net in less than 10 lines of code. There’s only so far you can go by optimizing for the correct model architecture for speed and accuracy. The rest is up to the information that your model learns from. The famous garbage in garbage out parable is applicable here.

NLP Data Prep

Interestingly there’s been research into creating a better performing GPT-3 where researchers just trained a 530 billion parameter model called MT-NLG vs GPT-3s 175 billion model. They found that there was no significant improvement on GPT-3. Then there comes a model called Chinchilla which has a relatively small 70 billion parameters. The difference is in the abundance of training data which shows that with extensive pre-training Chinchilla was able to surpass the performance of GPT-3.

A thread for the interested: https://twitter.com/cwolferesearch/status/1604969213189246976

A thread for the interested: https://twitter.com/cwolferesearch/status/1604969213189246976

Now you understand the importance of data. You may notice that it says ‘training tokens’ instead of training data in the screenshot above. This is due to the concept of tokenization. Machine learning models are made up of a complex interconnected web of numbers and equations, there’s no way for it to know what the concept of ‘complexity’ or ‘humor’ is.

Our brains also perform a form of tokenization, converting soundwaves into electrical signals through the ears. Similarly, tokenization is the process of converting text into tokens.

Read more on tokenization: https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/

Read more on tokenization: https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/

There are different tokenization methods such as word tokenization, character tokenization, and sub word tokenization (and different algorithms for each). But this single step data preparation for natural language processing pales in comparison the preparation required for most tabular datasets.

Tabular data is data that is in the form of a table. Usually the columns represent the different data features while the rows represent different datapoints. Today I’ll be covering everything from data analysis to feature selection.

Untitled

Data training is a highly iterative cycle, usually the entire process before is iterated upon multiple times trying different things each time. Usually you want to start with the simplest version of each of the steps and see how you do. Then iterate step by step by both improving your data quality, and improving your model choices. An entire course can created on each of the above so it’s a given that this blog is more of an overview.

Let’s bring in a real world example - the stroke prediction dataset on Kaggle with 11 features.

Stroke Prediction Dataset