Data Wrangling : General Ledger as CSV : Import and data types

The first step of an audit is to get the financial information you are auditing from your client. At the very least this comprises a set of financial statements and a trial balance. Preferably your client will also give you a data dump of the full general ledger. At some point you will get other supporting schedules like aged debt, fixed asset registers and the ilk. We'll get to them in due course.

To get us up and running, we're going to rule out working with client documentations which are PDFs, because they require additional steps to get into R. Get your client into the habit of giving you documents in a file format which you can work with immediately!

The two main file formats we are going to look at are CSV and XLSX. These should cover 99% of audits.

Today is all about CSVs. A CSV (Comma Separated Value) file is a common export document type for data from financial reporting software. It stores data in a tabular structure in a plain text format. This means it can be opened and viewed in a text editor. You would see that the each value is separated by a comma (or another similar delimiter, like a semicolon). If you open the same document in a spreadsheet application like Excel, the software knows to put each value into its own cell based on those delimiters. Remember that some countries use a comma instead of a decimal point (or period) and vice verse, and that's why you may need a delimiter that isn't a comma.


 Key characteristics of .csv files:

- CSV files are plain text files, which means they can be easily opened and edited with any text editor, including Notepad, TextEdit, or Vim. This makes them easy to read and write using programming languages such as Python, R, and Java.

- CSV files organize data into rows and columns. Each row represents a record or item, while each column represents a field or attribute.

- CSV files often include a header row at the beginning of the file that describes the columns in the data set.

- There is no compression or optimization involved in the CSV file format. They can therefore be much bigger than XLSX files.

- No formatting, formulas, macros or other fancy details are retained in the file. A CSV file only contains the underlying data.

- You can open a CSV file in a text editor and it will be comprehensible to you due to the lack of extra information about formatting etc. An XLSX doc opened the same way will include all that additional information and be harder to understand.

And maybe most importantly :

- CSV files do not store data type information, which means that all values are treated as strings. If you need to perform calculations or use specific data types, you will need to convert the data to the correct type after reading the file.


All this means that CSV files are eminently usable for our purposes, but we need to be mindful of specifying the data types before analysing the data and also that CSV files can be very large.

It also means that an erroneously placed delimiter (like a comma in a transaction description field) can throw everything askew!


Reading A CSV file

We will be using readr for wrangling CSV files. Readr comes as part of the tidyverse aggregation of packages. It's an improvement over base R functions for three reasons:

- Its faster

- It produces tibbles

- It is more reproducible (this one is most important to auditors!) because some base R functions inherit behaviour from your computer, so what happens one system might not happen on another.


Comprehensive documentation on readr is available at https://r4ds.had.co.nz/data-import.html and https://www.rdocumentation.org/packages/readr/versions/2.1.4 or in R using ?readr function.


To import a CSV, use the read_csv function, specifying the location of the CSV in relation to your current folder, then specifying the data types for your columns. You have two options - either specify your data types in the read_csv function, or let read_csv take a guess at what the data types are and subsequently try to fix any that weren't correctly identified.


For the former, you could use:

ledger1 <- read_csv("ledger_fy22_qtr1.csv", col_types = cols(.default = "c"), na = "NA")

which would read every column in as characters and make any NA values populate with "NA". For the latter:

ledger1 <- read_csv("ledger_fy22_qtr1.csv")

where read_csv will guess what is in every column and you will then need to use parse_xyz() to amend this.

The latter is easier because once the CSV is read into R then you can look at it and work out what has been mapped incorrectly. However, the former is probably a best practice way to do it (albeit more iterative) because your code will be cleaner and because it sidesteps any potential problems when converting between data types.

So what if you wanted to specify the data type of certain columns as something other than character? You could include this 

Ledger2 <- read_csv("ledger_fy22_qtr2.csv", col_types = cols(.default = "c", POSTED_TOTAL_AMT = "d"), na = "NA")

And this would make specify every column as type "character" except for the column POSTED_TOTAL_AMT which is designated as "double" (a double is a number that includes a decimal point; cf an integer which is a whole number - rarely will an amount posted in a ledger be a whole number). Note that read_csv has a single letter designation for the various data types; use "c" instead of "character".

This is quite acceptable if we have a single outlier and we want to make it very clear that we have made an exception for it. However, often a GL will have multiple columns (eg Debit, Credit, Date) which we would want to set the data type for. Using the above approach would become messy very quickly. Instead, we could define all the column types in a single string and then call that vector when we read_csv, like this:

GLColTypes <- c("cccccccccccccccccccccccccciccccccccc")

Ledger3 <- read_csv("ledger_fy22_qtr3.csv", col_types = GLColTypes, na = "NA")

Here, col_types looks at the contents of GLColTypes and assigns the letters in the string as the data type for each column in consecutive order. 36 letters matching 36 columns in our dataset.

This works but it is rather inelegant. So lets use the stringr package to make this a bit tidier.

Stringr is a package we can use to manipulate strings. A string can be thought of as anything from a single letter to a word, sentence or more, or a single number or multiple numbers. The important aspect is that, whether your string is letters or numbers or both, it's considered to be the data type 'character' (not 100% accurate but we'll get to that later!)

Stringr, like readr, is part of the tidyverse suite of packages. Lots of info is available at https://r4ds.had.co.nz/strings.html and https://www.rdocumentation.org/packages/stringr/versions/1.5.0 or in R using ?stringr function.

We will explore other functions in stringr as time goes on, but for now all we need is the str_c function:


GLColTypes <- c(rep("c", 26), "i", rep("c", 8))

GLColTypes <- str_c(GLColTypes, collapse = "")

The first line creates a vector called GLColTypes which is made up of 26 repetitions of "c", 1 of "i" then a further 8 of "c". If we printed it in R it would look like this:

> GLColTypes

 [1] "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c"

[23] "c" "c" "c" "c" "i" "c" "c" "c" "c" "c" "c" "c" "c"

The problem is that col_types is looking for a contiguous string of letters to use when defining column types. It ignores the quotation marks but when it gets to the first space in the sequence then it gets confused because a space doesn't mean anything to col_types.

What we want to do is collapse that whole thing down into a single string of 36 letters, and we do that with the second line. It applies the str_c() function to GLColTypes and uses the argument collapse = "" to tell it that the 36 separate strings of GLColTypes should be turned into a single string with no characters in between them.

So now when we print GLColTypes, we get:

> GLColTypes

[1] "ccccccccccccccccccccccccccicccccccc"


col_types can understand this, so we can now call GLColTypes when reading in our ledger part and read_csv will set the column data types as we have defined them.

This final solution is a lot easier for bug checking purposes. It's much easier to count 26 + 1 + 8 = 35. Wait, 35? But our GL has 36 columns!

str_length(GLColTypes)

[1] 35

Sure enough we have made a mistake and only defined 35 columns using GLColTypes. So what happened to column 36? Well, read_csv did what it normally does and took a guess at what the column data type would be based on looking at the first 1,000 rows of the dataset. In this case, column 36 is called "ROWID" and comprises a series of unique row IDs made up alphanumeric strings. So read_csv guessed it was type "character" (which is correct). Lucky for us!

You repeat this process for each part of the ledger you need to load (lets assume four, one for each quarter of the financial period).

After all that, our code is simply :

library(tidyverse)

GLColTypes <- c(rep("c", 26), "d", rep("c", 8))

GLColTypes <- str_c(GLColTypes, collapse = "")

Ledger1 <- read_csv("ledger_fy22_qtr1.csv", 

                    col_types = GLColTypes, na = "NA")

Ledger2 <- read_csv("ledger_fy22_qtr2.csv", 

                    col_types = GLColTypes, na = "NA")

Ledger3 <- read_csv("ledger_fy22_qtr3.csv", 

                    col_types = GLColTypes, na = "NA")

Ledger4 <- read_csv("ledger_fy22_qtr4.csv", 

                    col_types = GLColTypes, na = "NA")


However, we've learnt some skills which we can apply to reading in our other documentation. Next time we'll look at joining these sections together and doing some continuity checks.

Comments

Popular posts from this blog

Common Data Model for Financial Auditors : Part 2

Common Data Model for Financial Auditors : Part 1