View Forum Topic

Forums
CS4540
Performance Issues in Importing CSV File

Author Message

Author	Message
cysun Posts: 2935	Posted 22:49 Jul 15, 2020 \| Several students mentioned that it took a very long time to import a CSV file in Homework 2, so I think this is a good opportunity to talk about performance issues related to database operations. Time Complexity Is Not Just Big O You know that if you count the operations in your program, you get the complexity of the program, and, for example, O(N) would be a lot better than O(N^2). However, you may not know that there are different types of operations, and some are much much more expensive than others -- this means that your Big O is meaningless if you count the wrong types of operations. In a computer system, disk operations are orders of magnitude slower than in-memory operations. Exactly how much slower depends on lots of factors (e.g. types of ram/disk, random vs sequential access, latency vs bandwidth etc.), but you can generally assume that it's at least 1000 times slower. In most programs that involve in-memory operations and random disk access operations, you can pretty much consider in-memory operations free and only consider disk operations. Database operations usually involve disk access (because you don't want your bank to mess up your account if there's a power outage). When you have a remote database server like in our case, database operations are even more expensive because each database operation also involves a round trip between your computer and CS3. In our case of importing CSV, the only thing matters performance-wise is the number of database operations performed. Making Database Operations Faster 1. Do fewer of them. The less database operations your program performs, the faster it is. 2. One "large" query is faster than several "small" ones. For example, one "select * from cities where id=1" is much faster than one "select name from cities where id=1 " plus one "select population from cities where id=1". 3. One "large" insert is faster than several "small" ones. Most database systems support bulk insert. For example, to insert two records into a database, instead of doing two "insert into values ()", you can do one "insert into values () ()". 4. Index. Database searches are fast not because of magic but because of indexes. When you create a table, most database systems will automatically create an index on the primary key, but if you need more indexes, you have to do it yourself. In our case, city names are unique, and you often need to search for a city by its name, so you definitely should create a index on city name. See https://docs.microsoft.com/en-us/ef/core/modeling/indexes for how to do that. Case Study Consider the following code: `var cityId = db.Cities.Where(c => c.Name == name).Select(c => c.Id).SingleOrDefault(); if( cityId == 0 ){ var city = new City { Name = name, Population = population }; db.Cities.Add(city); db.SaveChanges(); } cityId = db.Cities.Where(c => c.Name == name).Select(c => c.Id).SingleOrDefault(); var city = db.Cities.Find(cityId); city.Records.Add( new Record(cases, tests, deaths) ); db.SaveChanges();` I'll leave it as an exercise for you to identify which part of the code violate which rule of "Making Database Operations Faster". Keep in mind that the code above is executed once for each row in a CSV file, so every inefficiency is amplified 300 times.

cysun

Posts: 2935

Posted 22:49 Jul 15, 2020 |

Several students mentioned that it took a very long time to import a CSV file in Homework 2, so I think this is a good opportunity to talk about performance issues related to database operations.

Time Complexity Is Not Just Big O

You know that if you count the operations in your program, you get the complexity of the program, and, for example, O(N) would be a lot better than O(N^2). However, you may not know that there are different types of operations, and some are much much more expensive than others -- this means that your Big O is meaningless if you count the wrong types of operations.

In a computer system, disk operations are orders of magnitude slower than in-memory operations. Exactly how much slower depends on lots of factors (e.g. types of ram/disk, random vs sequential access, latency vs bandwidth etc.), but you can generally assume that it's at least 1000 times slower. In most programs that involve in-memory operations and random disk access operations, you can pretty much consider in-memory operations free and only consider disk operations.

Database operations usually involve disk access (because you don't want your bank to mess up your account if there's a power outage). When you have a remote database server like in our case, database operations are even more expensive because each database operation also involves a round trip between your computer and CS3. In our case of importing CSV, the only thing matters performance-wise is the number of database operations performed.

Making Database Operations Faster

1. Do fewer of them. The less database operations your program performs, the faster it is.

2. One "large" query is faster than several "small" ones. For example, one "select * from cities where id=1" is much faster than one "select name from cities where id=1 " plus one "select population from cities where id=1".

3. One "large" insert is faster than several "small" ones. Most database systems support bulk insert. For example, to insert two records into a database, instead of doing two "insert into values ()", you can do one "insert into values () ()".

4. Index. Database searches are fast not because of magic but because of indexes. When you create a table, most database systems will automatically create an index on the primary key, but if you need more indexes, you have to do it yourself. In our case, city names are unique, and you often need to search for a city by its name, so you definitely should create a index on city name. See https://docs.microsoft.com/en-us/ef/core/modeling/indexes for how to do that.

Case Study

Consider the following code:

var cityId = db.Cities.Where(c => c.Name == name).Select(c => c.Id).SingleOrDefault(); if( cityId == 0 ){ var city = new City { Name = name, Population = population }; db.Cities.Add(city); db.SaveChanges(); } cityId = db.Cities.Where(c => c.Name == name).Select(c => c.Id).SingleOrDefault(); var city = db.Cities.Find(cityId); city.Records.Add( new Record(cases, tests, deaths) ); db.SaveChanges();

I'll leave it as an exercise for you to identify which part of the code violate which rule of "Making Database Operations Faster". Keep in mind that the code above is executed once for each row in a CSV file, so every inefficiency is amplified 300 times.