Categories
Database Development

Storing data for machine learning in MySQL

I am currently in the process of developing a machine learning application, where my users can upload their own data to train models. Furthermore, the users must be able to extend their uploaded data with new data (append new data to it). I am in a bit of doubt when it comes to actually store this (and whether or not it should be in a DB)

The data source can be:

  • CSV file
  • Excel file

As each file is unique, the column names will also be dynamic. For example, see below:

// File_1.csv
comment
----------
Some text here
another text here
// File_1.xlsx
ingredient
----------
tomato
strawberry

At some point, after the data has been uploaded, the user should be able to extend it (append new data to it). For example:

// File_1 data:
text added at a later date

Labels

Now, after the data has been uploaded for each model, a user must be able to add labels to it (for example for text classification). For simplicity, let’s assume that we have two labels: label_one and label_two.

In order to store this, I was thinking of creating a table called model_data.

enter image description here

Inside this table, I would, for each model, save the data in the data column (in JSON). Something like:

{
   "rows":[
      {
         "text":"Some text here",
         "labels":[
            {
               "label_one":true,
               "label_two":false
            }
         ]
      },
      {
         "text":"Another text here",
         "labels":[
            {
               "label_one":true,
               "label_two":false
            }
         ]
      },
      {
         "text":"tomato",
         "labels":[
            {
               "label_one":false,
               "label_two":true
            }
         ]
      },
      {
         "text":"strawberry",
         "labels":[
            {
               "label_one":false,
               "label_two":true
            }
         ]
      },
      {
         "text":"text added at a later date",
         "labels":[
            {
               "label_one":false,
               "label_two":true
            }
         ]
      }
   ]
}

Is this even a feasible way to store such data? Typically, data for ML tasks often consists of large datasets with thousands of rows.

Alternative

I have also considered just using a filesystem to maintain this, and then store data for each model in a CSV file. So whenever a user uploads some data (and extends it), it will be added to a new CSV file.

Leave a Reply

Your email address will not be published. Required fields are marked *