Big Data Research and SQL

15 Mar 2021 2289Words 5Minutes Times Tech
Donate
Creative Commons License

SQL database is usually quite manageable when dealing with a small amount of data. When dealing with a huge volume of data during research, however, it is advised not to use SQL database, especially SQLite.

The Problem

The problem of SQL database is that it does not take scaling up ability during design. In many of the cases, the processing time for searching operations increases by O(N) where N represent the size of the database. If the data processing operations involving a lot of searches, then the bigger the database you have, the longer the operation will take, in a scale of O(N^2).

Let’s suppose that processing 1 file takes 0.5 second, while for each file in the database the processing time will be increased by 0.02s.

1 File = 0.5s * 1 File+ 0.02 * (0 file in the database) = 0.5 Second for 1 File

50 Files = 0.5s * 50 Files + 0.02 * (0 + 1 + 2 + … + 49) = 49.5 Seconds for 50 Files

500 Files = 0.5s * 500 Files + 0.02 * (0 + 1 + … + 499) = 2745 Seconds for 500 Files

5000 Files = 0.5s * 5000 Files + 0.02 * (0 + … + 4999) = 252,450 Seconds for 5000 Files

This quickly rolls out of control when you are dealing with a huge amount of data, therefore we need to find a database that is better scaled up.

Any workarounds?

We could separate the operations into smaller SQLite Databases, instead of using just one.

This way, each operation will cost more (due to the increased operations of open, commit, and closing databases), while creating a tons of files in your hard drive, but in all the processing time will comedown due to the scaling up problem is mitigated.

Let’s suppose in this case processing 1 file take 2 seconds, while the scaling up effect is reduced to 0.001s for each file.

1 File = 2s

50 Files = 2s * 50 Files + 0.001 * (0 + 1 + … + 49) = 100.49 Seconds for 50 Files

500 Files = 2s * 500 Files + 0.001 * (0 + 1 + … + 499) = 1124.75 Seconds for 500 Files

5000 Files = 2s * 5000 Files + 0.001 * (0 + 1 + … + 4999) = 22497.5 Seconds for 5000 Files

Depends on how many files you need to process and the size of your database, this method alone could effectively reduce your processing time by a numeric factor.

Another possible work around would be to utilize multi-core/thread processing abilities from most hardware. Usually SQLite only allows the using of 1 CPU cores, due to the possibility of data loss if simultaneous writes are performed. If sufficient data lock measures are implemented and enough cores are used, dramatic increases in performance could be observed.

Concrete Solutions?

SQLite, as the name suggests, is designed to be Light, offering highly efficient and all-in-one solutions to smaller databases, but performs horrible when there are a great amount of data present. To minimize the data scaling problem, we need to switch to another database structure.

(unfinished)