This page illustrates the use of the package dataresqc
by means of practical examples. In the first part, a daily data series is downloaded from Climate Explorer, converted to the Station Exchange Format, analysed visually, tested with the automatic tests, and finally flagged. In the second part, a recently digitized sub-daily pressure series for Rosario de Santa Fe is analyzed.
For instructions on how to install dataresqc
, see here.
In this example we will look for quality problems in a daily maximum temperature series.
First of all, open R and load dataresqc
using:
library(dataresqc)
Then choose a working directory where the output will be saved (if necessary, you can create a new directory with dir.create
):
dir.create("my_path/my_dir")
setwd("my_path/my_dir")
Now open Climate Explorer. Select "maximum temperature" in the "GHCN-D"" panel, write the name of a city in the "Select" panel, and click on "Get stations":
You should now get one or more entries of records for the required city (if none appears, try a different city). Click then on "get data":
In the next window, click on "raw data":
Finally, copy the url of the data file from the address bar of the browser:
Paste the url into the following command in R (do not forget the quotes):
climexp_to_sef("<url>", outpath = getwd())
This command will create a SEF file in your working directory containing the data for the city that you selected. The name of the file is given automatically by the software and includes: source ID, station ID, start date, end date, and variable code.
Assign the name of the file to the variable myfile
, so that you do not have to remember it later (in this example the filename is that of the Buenos Aires station, you will have a different filename):
myfile <- "GHCN-D_AR000875850_19080902-20190331_Tx.tsv"
Now you are ready to apply any of the functions included in dataresqc
to the data that you downloaded.
Let us start with a simple plot. For that, you can use the function plot_daily
:
plot_daily(myfile, outfile = "Tx.pdf")
#> Plot saved in Tx.pdf
The plot will be saved in the file 'Tx.pdf' in your working directory and will look like this:
One panel is plotted for each year. You can plot more than one year per panel, for instance three, using the parameter len (here we give a different name to the file to avoid overwriting the old one):
plot_daily(myfile, len = 3, outfile = "Tx3.pdf")
#> Plot saved in Tx3.pdf
It is also possible to change most of the graphical parameters. For example, to make the points smaller and change their color:
plot_daily(myfile, len = 3, cex = 0.5, col = "blue", outfile = "Tx_blue.pdf")
#> Plot saved in Tx_blue.pdf
To see which other parameters you can change, call the function documentation with:
?plot_daily
Now we can look at the actual reporting resolution of the data, by using the function plot_decimals
:
plot_decimals(myfile, outfile = "decimals.pdf")
#> Warning: Removed 10 rows containing missing values (`position_stack()`).
#> Warning: Removed 140 rows containing missing values (`position_stack()`).
#> Plot saved in decimals.pdf
In this example, certain decimals (.2, .4, .7, .9) are much less frequent than the others, suggesting that the actual resolution of the data is somewhat coarser than 0.1°C (this probably involved multiple roundings and conversions).
The function climatic_outliers
looks for values in the tails of the distribution. You can set the threshold that defines the outliers as a multiple of the interquartile range. For example:
climatic_outliers(myfile, outpath = getwd(), IQR=2)
#> Climatic outliers test completed
will return all values that are more than 2 interquartile ranges smaller than the 25th percentile or more than 2 interquantile ranges larger than the 75th percentile (the test is done for each month separately).
If any value that exceeds these thresholds is found, a text file starting with 'qc_' will be created in the working directory. It will look like this:
Var Year Month Day Value Test
Tx 1918 7 30 28.9 climatic_outliers
Tx 1940 3 25 13.8 climatic_outliers
Tx 1944 9 29 34 climatic_outliers
Tx 1946 9 11 32.3 climatic_outliers
Tx 1951 6 14 28.5 climatic_outliers
Tx 1957 1 29 43.3 climatic_outliers
Tx 1965 6 19 27.6 climatic_outliers
Tx 1979 7 29 30.2 climatic_outliers
Tx 1979 7 30 28.3 climatic_outliers
Tx 1979 7 31 29.5 climatic_outliers
Tx 1993 2 8 15.9 climatic_outliers
Tx 1993 2 9 16 climatic_outliers
Tx 1995 9 6 32.6 climatic_outliers
Tx 1996 8 24 33.7 climatic_outliers
Tx 1997 7 28 29.4 climatic_outliers
Tx 1997 7 29 29.3 climatic_outliers
Tx 2002 8 25 32 climatic_outliers
Tx 2013 9 11 35.3 climatic_outliers
Tx 2014 10 28 35.6 climatic_outliers
On request, this function also produces a boxplot and saves it into a pdf:
climatic_outliers(myfile, outpath = getwd(), IQR = 2, bplot = TRUE,
outfile = "boxplot.pdf")
#> Climatic outliers test completed
Again, it is possible to change most graphical parameters:
climatic_outliers(myfile, outpath = getwd(), IQR = 2, bplot = TRUE,
col = "red", lwd = 2, outfile = "boxplot_red.pdf")
#> Climatic outliers test completed
The function qc
automatically applies all possible quality tests. Its use is very simple:
qc(myfile, outpath = getwd())
#>
#> ----------------
#> AR000875850
#> ----------------
#>
#> [ Tx (daily) ]
#> Climatic outliers test completed
#> Temporal coherence test completed
#> Repetition test completed
#> Duplicate dates test completed
#> Big errors test completed
#> Duplicate columns test completed
For a single maximum temperature series, it will apply five tests (climatic outliers, temporal coherence, repetitions, duplicate dates, big errors). Note that the thresholds cannot be customized when you use qc
.
The additional suspicious values found by the tests will be added to the text file that you already created with climatic_outliers
.
The last step of the quality control process is to add flags to the data file. This is done with the function write_flags
:
write_flags(myfile, "qc_AR000875850_Tx_daily.txt", outpath = getwd(),
note = "qc")
#> Data written to file D:/Lavoro/Copernicus/dataresqc/vignettes/GHCN-D_AR000875850_19080902-20190331_Tx_qc.tsv
Here 'qc_AR000875850_Tx_daily.txt' must be replaced by the name of your text file containing the table of suspicious values. A new SEF file is created in the working directory (outpath = "."
) and its name will be the same of the input file with an additional '_qc' at the end (set by the parameter note to avoid overwriting the input file).
If you want, you can remove lines from the table of suspicious values, if you think they are false alarms. Or you can add new lines based on the plots or other forms of data inspection (remember that the columns must be separated by tabs, not spaces). Keep in mind that the last column in the table will become the flag description in the SEF file.
If everything went fine, when you open the new SEF file you should find instances like this:
1918 7 29 day 23.8
1918 7 30 day 28.9 qc=climatic_outliers
1918 7 31 day 15.3
Moreover, the name and version of the software that was used is printed in the header.
The internal consistency test for maximum and minimum temperature simply checks that the minimum temperature is not larger than the maximum temperature. For this test, you will need a minimum temperature series for the same station you selected before. You can get the minimum temperature series by repeating the procedure described in Section 1.1 and selecting 'minimum temperature' in the first page. Then:
climexp_to_sef("<url>", outpath = getwd())
myfile2 <- "GHCN-D_AR000875850_19080902-20190331_Tn.tsv"
Now you only need to use the function internal_consistency
:
internal_consistency(c(myfile, myfile2), outpath = getwd())
#> Internal consistency test completed
If no inconsistent values are found, then nothing will happen. Otherwise, the inconsistent values will be added to the usual text file, plus a new file will be created for the minimum temperature.
Download a precipitation series from Climate Explorer; try the function plot_weekly_cycle
(remember: you can check the function documentation with ?plot_weekly_cycle
, or you can have a look at the wiki)
Download a snowfall and a snow depth series for the same station from Climate Explorer; try to run the internal consistency test (function internal_consistency
).
The Station Exchange Format is not the only input format that dataresqc
can read. Data frames are also accepted; however, it is a little more complex to use them, because an additional data frame with metadata must be provided for most functions.
Recently digitized data for Bern (Switzerland) and Rosario de Santa Fe (Argentina) are stored in the variables Bern
and Rosario
, respectively. These variables are structured as lists, where every element of the list is a data frame containing one specific variable. We will analyze pressure in Rosario (1886-1900), so we will need the variable Rosario$p
.
To plot the data points we use the function plot_subdaily
:
plot_subdaily(Rosario$p, year = 1887, outfile = "1887.pdf")
#> Plot saved in 1887.pdf
This will plot only the year 1887 (one pdf per year is created), with one panel for each month (for simplicity, panels after March are omitted in the figure above). As usual, graphical parameters can be edited.
Similarly to the daily data, we can perform all available tests at once with the function qc
. Here, however, we must provide metadata (stored in the list Meta
) and indicate that the resolution of the data is sub-daily. Moreover, since the time is in UTC, we must provide an offset to be added to obtain local time (in this case minus 4 hours 17 minutes, i.e. the standard time in Argentina at the end of the 19th century).
All this is done by the following line:
qc(Rosario$p, Metadata = cbind(Meta$p[2,],"subdaily"), outpath = getwd(),
time_offset = -4.28)
#>
#> ----------------
#> Rosario
#> ----------------
#>
#> [ p (subdaily) ]
#> Climatic outliers test completed
#> Repetition test completed
#> Duplicate times test completed
#> WMO time consistency test completed
#> WMO gross errors test completed
#> Duplicate columns test completed
The resulting table of suspicious values should include 29 entries:
Var Year Month Day Hour Minute Value Test
p 1895 8 31 12 17 102.71 climatic_outliers;wmo_time_consistency;wmo_gross_errors
p 1895 8 31 22 17 1031.64 wmo_time_consistency
p 1896 2 14 12 17 1012.04 wmo_time_consistency
p 1896 2 14 22 17 928.15 climatic_outliers;wmo_time_consistency
p 1896 3 8 12 17 1008.83 wmo_time_consistency
p 1896 3 8 22 17 76.9 climatic_outliers;wmo_time_consistency;wmo_gross_errors
p 1896 3 16 12 17 1015.27 wmo_time_consistency
p 1896 3 16 22 17 1277.61 climatic_outliers;wmo_time_consistency;wmo_gross_errors
p 1896 4 1 12 17 1278.44 climatic_outliers;wmo_time_consistency;wmo_gross_errors
p 1896 4 1 22 17 1010.39 wmo_time_consistency
p 1896 4 2 12 17 1275.56 climatic_outliers;wmo_time_consistency;wmo_gross_errors
p 1896 4 2 22 17 1008.15 wmo_time_consistency
p 1897 4 10 12 17 1279.96 climatic_outliers;wmo_time_consistency;wmo_gross_errors
p 1897 4 10 22 17 1014.62 wmo_time_consistency
p 1898 2 1 12 17 29.4 climatic_outliers;wmo_time_consistency;wmo_gross_errors
p 1898 2 1 22 17 1000.27 wmo_time_consistency
p 1898 7 16 12 17 1027.23 wmo_time_consistency
p 1898 7 16 22 17 895.4 climatic_outliers;wmo_time_consistency
p 1898 12 9 12 17 1008.45 wmo_time_consistency
p 1898 12 9 22 17 1044.74 climatic_outliers;wmo_time_consistency
p 1898 12 10 12 17 999.96 wmo_time_consistency
p 1898 12 10 22 17 969.09 climatic_outliers;wmo_time_consistency
p 1898 12 31 12 17 1035.65 climatic_outliers;wmo_time_consistency
p 1898 12 31 22 17 1005.38 wmo_time_consistency
p 1899 2 3 12 17 1003.55 wmo_time_consistency
p 1899 2 3 22 17 604.98 climatic_outliers;wmo_time_consistency
p 1899 7 25 12 17 1008.59 wmo_time_consistency
p 1899 7 25 22 17 977.68 wmo_time_consistency
p 1900 5 20 12 17 973.42 climatic_outliers
Many of them are clearly related to digitization errors (typos) and can be corrected if the original data source is available.
We appreciate any feedback. You can use this anonymous Google Form or write an email to yuri.brugnara@gmail.com.