Volume 8, Issue 2 p. 154-198
DATA PAPER
Open Access

Regionally aggregated, stitched and de-drifted CMIP-climate data, processed with netCDF-SCM v2.0.0

Zebedee Nicholls

Corresponding Author

Zebedee Nicholls

Australian–German Climate and Energy College, The University of Melbourne, Parkville, Vic., Australia

School of Earth Sciences, The University of Melbourne, Parkville, Vic., Australia

Correspondence

Zebedee Nicholls, Australian German Climate & Energy College, University of Melbourne, Parkville, Vic., 3010, Australia.

Email: [email protected]

Search for more papers by this author
Jared Lewis

Jared Lewis

Australian–German Climate and Energy College, The University of Melbourne, Parkville, Vic., Australia

Search for more papers by this author
Melissa Makin

Melissa Makin

Science IT, Faculty of Science, The University of Melbourne, Parkville, Vic., Australia

Search for more papers by this author
Usha Nattala

Usha Nattala

Melbourne Data Analytics Platform (MDAP), The University of Melbourne, Parkville, Vic., Australia

Search for more papers by this author
Geordie Z. Zhang

Geordie Z. Zhang

Melbourne Data Analytics Platform (MDAP), The University of Melbourne, Parkville, Vic., Australia

Search for more papers by this author
Simon J. Mutch

Simon J. Mutch

Melbourne Data Analytics Platform (MDAP), The University of Melbourne, Parkville, Vic., Australia

Search for more papers by this author
Edoardo Tescari

Edoardo Tescari

Melbourne Data Analytics Platform (MDAP), The University of Melbourne, Parkville, Vic., Australia

Search for more papers by this author
Malte Meinshausen

Malte Meinshausen

Australian–German Climate and Energy College, The University of Melbourne, Parkville, Vic., Australia

School of Earth Sciences, The University of Melbourne, Parkville, Vic., Australia

Potsdam Institute for Climate Impact Research (PIK), Member of the Leibniz Association, Potsdam, Germany

Search for more papers by this author
First published: 23 February 2021
Citations: 6

Dataset details

Identifier: https://doi.org/10.5281/zenodo.4536523

Creator: Zebedee Nicholls, Jared Lewis, Melissa Makin, Usha Nattala, Geordie Z. Zhang, Simon J. Mutch, Edoardo Tescari, Malte Meinshausen

Dataset correspondence: [email protected]

Title: Regionally aggregated, stitched and de-drifted CMIP-climate data, processed with netCDF-SCM v2.0.0

Publisher: Zenodo

Publication year: 2020

Resource type: Dataset

Version: 20210212

Abstract

The world's most complex climate models are currently running a range of experiments as part of the Sixth Coupled Model Intercomparison Project (CMIP6). Added to the output from the Fifth Coupled Model Intercomparison Project (CMIP5), the total data volume will be in the order of 20PB. Here, we present a dataset of annual, monthly, global, hemispheric and land/ocean means derived from a selection of experiments of key interest to climate data analysts and reduced complexity climate modellers. The derived dataset is a key part of validating, calibrating and developing reduced complexity climate models against the behaviour of more physically complete models. In addition to its use for reduced complexity climate modellers, we aim to make our data accessible to other research communities. We facilitate this in a number of ways. Firstly, given the focus on annual, monthly, global, hemispheric and land/ocean mean quantities, our dataset is orders of magnitude smaller than the source data and hence does not require specialized ‘big data’ expertise. Secondly, again because of its smaller size, we are able to offer our dataset in a text-based format, greatly reducing the computational expertise required to work with CMIP output. Thirdly, we enable data provenance and integrity control by tracking all source metadata and providing tools which check whether a dataset has been retracted, that is identified as erroneous. The resulting dataset is updated as new CMIP6 results become available and we provide a stable access point to allow automated downloads. Along with our accompanying website (cmip6.science.unimelb.edu.au), we believe this dataset provides a unique community resource, as well as allowing non-specialists to access CMIP data in a new, user-friendly way.

1 INTRODUCTION

Coupled atmosphere-ocean earth system models are our most comprehensive representations of the climate system. Earth system models from around the globe are currently running as part of the Sixth Coupled Model Intercomparison Project (CMIP6, Eyring et al. (2016a). CMIP6 builds on the Fifth Coupled Model Intercomparison Project (CMIP5, Taylor et al. (2012)), the output of which is still widely used. Combined, these two projects represent our most comprehensive, physically based estimates of the coupled earth system's behaviour under a wide range of experiments.

However, CMIP5 and CMIP6 models are computationally expensive hence cannot be run for all applications of interest. To fill this gap, a number of so-called reduced complexity climate models (also referred to as ‘simple climate models’) have been developed (Nicholls et al., 2020). An important part of developing such models is calibration: deriving a set of parameters, which allows them to best replicate the behaviour of the more complex models, which participate in the CMIP experiments. In order to do this calibration, one must first process the output from CMIP5 and CMIP6.

Once complete, the CMIP6 archive will be one of the world's largest data archives, with an expected total volume in the region of 18PB (Balaji et al., 2018). Fortunately, reduced complexity climate modellers typically only require annual-mean or monthly model output on hemispheric and land/ocean scales. This significantly reduces the volume of data they must handle. Nonetheless, reduced complexity climate models typically include a number of different modules, which cover the full range of the emissions-climate change cause-effect chain. Calibrating all of these different modules requires handling several datasets, such that the total volume of raw CMIP output of interest to a ‘comprehensive’ reduced complexity climate model will still be of the order of 50TB. Processing a data volume of this size is an intimidating task, even for expert users.

The data processing is further complicated by five extra factors. The first is that the raw data are all in the custom, climate-specific netCDF data format (Unidata, 2020). Without specialist training, it cannot be read, let alone analysed. The second factor is that the data are all sorted according to a highly regularized data reference syntax (Balaji et al., 2018). Such regularization is required to make the data able to be processed by machines; however, it can be confusing for non-experts. The third factor is that data are typically presented in absolute values. However, reduced complexity climate models are typically perturbation models; that is, they calculate perturbations from some reference state rather than absolute values. Given the data volume and sometimes complex relationship between CMIP experiments, calculating such perturbations for a large data volume is not a trivial task. The fourth factor is licensing. All CMIP5 and CMIP6 files are released under a specific licence which users must adhere to, and retrieving this information is not easily done. The final factor is retractions, that is removal of data, which is later identified as erroneous. Such retractions are essential to avoid erroneous results propagating into the scientific literature. However, at the moment there are only a limited amount of tools which allow a user to check whether they are using a retracted dataset or not.

The target audience for this dataset is reduced complexity climate models and modellers. Here, reduced complexity models refer to models which focus on global- and annual-mean properties of the climate system. As a result of their limited spatial and temporal resolution, reduced complexity models are very computationally efficient. These models are typically used when more complex models, such as those participating in CMIP, are too computationally expensive to use. For example, reduced complexity models are used within many integrated assessment models to assess the climate implications of different emissions pathways (given that integrated assessment models typically require hundreds to thousands of climate realizations, using CMIP models is not computationally feasible). Some prominent examples of reduced complexity models are MAGICC (Meinshausen et al., 2011), FaIR (Smith et al., 2018) and hector (Hartin et al., 2015). A detailed discussion of reduced complexity models and an overview of models available in the literature can be found in the first phase of the Reduced Complexity Model Intercomparison Project (Nicholls et al., 2020).

Our CMIP5- and CMIP6-derived dataset was extracted using the open source tool we developed, netCDF-SCM (netCDF handling for reduced complexity/simple climate modellers, see Section 2.2) and is ready for use by reduced complexity climate modellers. It is the result of addressing all the complications described above and includes global, hemispheric, land/ocean annual and monthly means of a variety of CMIP5 and CMIP6 output. Given the processing performed, this dataset is orders of magnitude smaller than the original data. The reduced data volume means that we can feasibly provide the data in a text-based format. Thus, while the dataset is targeted at developers of reduced complexity climate models, its simple, text-based format also allows non-expert users beyond the climate science community to read and analyse the data as they no longer need to engage with the climate-specific netCDF format.

2 DATA DESCRIPTION AND DEVELOPMENT

2.1 Datasets

To produce this derived dataset, we rely on the CMIP5 and CMIP6 archives (available via esgf-node.llnl.gov/search/cmip5/ and esgf-node.llnl.gov/search/cmip6/, respectively, last accessed 25 June 2020), both of which rely on the Earth System Grid Federation (Williams et al., 2011; Cinquini et al., 2014; Williams et al., 2016). Accordingly, any use of our derived set must also follow the conditions of the original data source and users should cite the relevant work of each modelling group whose output they use. Full details for CMIP5 can be found at pcmdi.llnl.gov/mips/cmip5/terms-of-use.html (last accessed 25 June 2020), and for CMIP6 at pcmdi.llnl.gov/CMIP6/TermsOfUse/TermsOfUse6-1.html (last accessed 25 June 2020).

To date, we have processed data from over 400 different CMIP6 model-experiment pairs and over 125 CMIP5 model-experiment pairs (Tables 1 and 2). We have built a custom software package, netCDF-SCM (further details in Section 2.2), that can handle any dataset that can also be handled by Iris, ‘a powerful, format-agnostic, community-driven Python library for analysing and visualizing Earth science data’ (Met Office, 2019). In practice, this restricts netCDF-SCM to handling CF-compliant datasets (details of CF-compliance available at http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html, last accessed: 25 June 2020). Having said that, netCDF-SCM is focussed on CMIP data and much of its value is based on the data reference syntax associated with CMIP output. The CMIP6 data reference syntax is available at github.com/WCRP-CMIP/CMIP6_CVs (last accessed: 25 June 2020), whilst the CMIP5 data reference syntax is available at https://pcmdi.llnl.gov/mips/cmip5/docs/cmip5_data_reference_syntax.pdf (last accessed: 25 June 2020).

TABLE 1. CMIP5 data used in this study
Modelling group Climate model Scenario Reference
CCCMA CanCM4 historical Canadian Centre For Climate Modelling And Analysis (CCCma) (2015a)
rcp45 Canadian Centre For Climate Modelling And Analysis (CCCma) (2015b)
CMCC CMCC-CM 1pctCO2 Scoccimarro and Gualdi (2014a)
historical Scoccimarro and Gualdi (2014b)
CMCC-CMS historical Centro Euro-Mediterraneo Sui Cambiamenti Climatici (CMCC) (2013a)
rcp45 Centro Euro-Mediterraneo Sui Cambiamenti Climatici (CMCC) (2013b)
rcp85 Centro Euro-Mediterraneo Sui Cambiamenti Climatici (CMCC) (2013c)
CNRM-CERFACS CNRM-CM5 1pctCO2 Sénési et al. (2014a)
abrupt4xCO2 Sénési et al. (2014b)
esmFdbk1 Sénési et al. (2014c)
historical Sénési et al. (2014d)
historicalExt Sénési et al. (2014e)
historicalGHG Sénési et al. (2014f)
historicalMisc Sénési et al. (2014g)
historicalNat Sénési et al. (2014h)
piControl Sénési et al. (2014i)
rcp26 Sénési et al. (2014j)
rcp45 Sénési et al. (2014k)
rcp85 Sénési et al. (2014l)
FIO FIO-ESM historical Qiao et al. (2013a)
rcp26 Qiao et al. (2013b)
rcp45 Qiao et al. (2013c)
rcp60 Qiao et al. (2013d)
rcp85 Qiao et al. (2013e)
GCESS BNU-ESM 1pctCO2 Ji et al. (2015a)
abrupt4xCO2 Ji et al. (2015b)
amip Ji et al. (2015c)
historical Ji et al. (2015d)
historicalGHG Ji et al. (2015e)
historicalMisc Ji et al. (2015f)
historicalNat Ji et al. (2015g)
piControl Ji et al. (2015h)
rcp26 Ji et al. (2015i)
rcp45 Ji et al. (2015j)
rcp85 Ji et al. (2015k)
IPSL IPSL-CM5A-LR 1pctCO2 Caubel et al. (2016a)
abrupt4xCO2 Foujols et al. (2016a)
esmFdbk1 Foujols et al. (2016b)
esmFixClim2 Bopp et al. (2016)
Historical Denvil et al. (2016a)
historicalGHG Caubel et al. (2016b)
historicalMisc Caubel et al. (2016c)
historicalNat Caubel et al. (2016d)
piControl Caubel et al. (2016e)
rcp26 Denvil et al. (2016b)
rcp45 Denvil et al. (2016c)
rcp60 Denvil et al. (2016d)
rcp85 Denvil et al. (2016e)
LASG-CESS FGOALS-g2 1pctCO2 LASG, Institute Of Atmospheric Physics, Chinese Academy Of Sciences (IAP-LASG) (2015a)
abrupt4xCO2 LASG, Institute Of Atmospheric Physics, Chinese Academy Of Sciences (IAP-LASG) (2015b)
historical LASG, Institute Of Atmospheric Physics, Chinese Academy Of Sciences (IAP-LASG) (2015c)
historicalGHG LASG, Institute Of Atmospheric Physics, Chinese Academy Of Sciences (IAP-LASG) (2015d)
historicalMisc LASG, Institute Of Atmospheric Physics, Chinese Academy Of Sciences (IAP-LASG) (2015e)
historicalNat LASG, Institute Of Atmospheric Physics, Chinese Academy Of Sciences (IAP-LASG) (2015f)
piControl LASG, Institute Of Atmospheric Physics, Chinese Academy Of Sciences (IAP-LASG) (2015g)
rcp26 LASG, Institute Of Atmospheric Physics, Chinese Academy Of Sciences (IAP-LASG) (2015h)
rcp45 LASG, Institute Of Atmospheric Physics, Chinese Academy Of Sciences (IAP-LASG) (2015i)
rcp85 LASG, Institute Of Atmospheric Physics, Chinese Academy Of Sciences (IAP-LASG) (2015j)
MIROC MIROC-ESM 1pctCO2 Japan Agency For Marine-Earth Science And Technology (JAM- STEC) et al. (2015a)
abrupt4xCO2 Japan Agency For Marine-Earth Science And Technology (JAM- STEC) et al. (2015b)
esmFixClim2 Japan Agency For Marine-Earth Science And Technology (JAM- STEC) et al. (2015c)
esmHistorical Japan Agency For Marine-Earth Science And Technology (JAM- STEC) et al. (2015d)
esmrcp85 Japan Agency For Marine-Earth Science And Technology (JAM- STEC) et al. (2015e)
historical Japan Agency For Marine-Earth Science And Technology (JAM- STEC) et al. (2015f)
historicalGHG Japan Agency For Marine-Earth Science And Technology (JAM- STEC) et al. (2015g)
historicalNat Japan Agency For Marine-Earth Science And Technology (JAM- STEC) et al. (2015h)
piControl Japan Agency For Marine-Earth Science And Technology (JAM- STEC) et al. (2015i)
rcp26 Japan Agency For Marine-Earth Science And Technology (JAM- STEC) et al. (2015j)
rcp45 Japan Agency For Marine-Earth Science And Technology (JAM- STEC) et al. (2015k)
rcp60 Japan Agency For Marine-Earth Science And Technology (JAM- STEC) et al. (2015l)
rcp85 Japan Agency For Marine-Earth Science And Technology (JAM- STEC) et al. (2015m)
MOHC HadGEM2-ES abrupt4xCO2 Webb et al. (2014)
MPI-M MPI-ESM-LR 1pctCO2 Giorgetta et al. (2012a)
abrupt4xCO2 Giorgetta et al. (2012b)
esmFdbk1 Reick et al. (2012)
historical Giorgetta et al. (2012c)
piControl Giorgetta et al. (2012d)
rcp26 Giorgetta et al. (2012e)
rcp45 Giorgetta et al. (2012f)
rcp85 Giorgetta et al. (2012g)
NASA GISS GISS-E2-H 1pctCO2 NASA Goddard Institute For Space Studies (NASA/GISS) (2014a)
abrupt4xCO2 NASA Goddard Institute For Space Studies (NASA/GISS) (2014b)
historical NASA Goddard Institute For Space Studies (NASA/GISS) (2014c)
historicalExt NASA Goddard Institute For Space Studies (NASA/GISS) (2014d)
historicalGHG NASA Goddard Institute For Space Studies (NASA/GISS) (2014e)
historicalMisc NASA Goddard Institute For Space Studies (NASA/GISS) (2014f)
historicalNat NASA Goddard Institute For Space Studies (NASA/GISS) (2014g)
piControl NASA Goddard Institute For Space Studies (NASA/GISS) (2014h)
rcp26 NASA Goddard Institute For Space Studies (NASA/GISS) (2014i)
rcp45 NASA Goddard Institute For Space Studies (NASA/GISS) (2014j)
rcp60 NASA Goddard Institute For Space Studies (NASA/GISS) (2014k)
rcp85 NASA Goddard Institute For Space Studies (NASA/GISS) (2014l)
NCAR CCSM4 1pctCO2 Meehl (2014i)
abrupt4xCO2 Meehl (2014j)
historical Meehl (2014a)
historicalGHG Meehl (2014h)
historicalMisc Meehl (2014f)
historicalNat Meehl (2014g)
piControl Gent (2014)
rcp26 Meehl (2014b)
rcp45 Meehl (2014c)
rcp60 Meehl (2014d)
rcp85 Meehl (2014e)
NCC NorESM1-M 1pctCO2 Bentsen et al. (2012a)
abrupt4xCO2 Bentsen et al. (2012b)
amip Bentsen et al. (2012c)
historical Bentsen et al. (2012d)
historicalExt Bentsen et al. (2012e)
historicalGHG Bentsen et al. (2012f)
historicalMisc Bentsen et al. (2012g)
historicalNat Bentsen et al. (2012h)
piControl Bentsen et al. (2011)
rcp26 Bentsen et al. (2012i)
rcp45 Bentsen et al. (2012j)
rcp60 Bentsen et al. (2012k)
rcp85 Bentsen et al. (2012l)
NOAA GFDL GFDL-CM3 1pctCO2 Horowitz et al. (2014a)
abrupt4xCO2 Horowitz et al. (2014b)
historical Horowitz et al. (2014c)
historicalGHG Horowitz et al. (2014d)
historicalMisc Horowitz et al. (2014e)
historicalNat Horowitz et al. (2014f)
rcp26 Horowitz et al. (2014g)
rcp45 Horowitz et al. (2014h)
rcp60 Horowitz et al. (2014i)
rcp85 Horowitz et al. (2014j)
GFDL-ESM2G 1pctCO2 Dunne et al. (2014a)
historicalMisc Dunne et al. (2014b)
GFDL-ESM2M 1pctCO2 Dunne et al. (2014c)
historicalGHG Dunne et al. (2014d)
historicalMisc Dunne et al. (2014e)
historicalNat Dunne et al. (2014f)
TABLE 2. CMIP6 data used in this study
Modelling Group Climate Model Scenario Reference
AWI AWI-CM-1-1-MR 1pctCO2 Semmler et al. (2018c)
abrupt-4xCO2 Semmler et al. (2018d)
historical Semmler et al. (2018e)
piControl Semmler et al. (2018f)
ssp126 Semmler et al. (2018a)
ssp245 Semmler et al. (2018b)
ssp370 Semmler et al. (2019a)
ssp585 Semmler et al. (2019b)
AWI-ESM-1-1-LR historical Danek et al. (2020)
BCC BCC-CSM2-MR 1pctCO2 Wu et al. (2018a)
1pctCO2-bgc Wu et al. (2019a)
1pctCO2-rad Wu et al. (2019b)
abrupt-4xCO2 Wu et al. (2018b)
esm-hist Wu et al. (2018c)
esm-piControl Wu et al. (2018d)
esm-ssp585 Wu et al. (2019c)
hist-GHG Wu et al. (2019e)
hist-aer Wu et al. (2019d)
hist-nat Wu et al. (2019f)
historical Wu et al. (2018e)
piControl Wu et al. (2018f)
ssp126 Xin et al. (2019a)
ssp245 Xin et al. (2019b)
ssp370 Xin et al. (2019c)
ssp585 Xin et al. (2019d)
BCC-ESM1 1pctCO2 Zhang et al. (2019c)
abrupt-4xCO2 Zhang et al. (2019d)
historical Zhang et al. (2018a)
piControl Zhang et al. (2018b)
ssp370 Zhang et al. (2019b)
ssp370-lowNTCF Zhang et al. (2019a)
CAMS CAMS-CSM1-0 1pctCO2 Rong (2019a)
abrupt-4xCO2 Rong (2019b)
historical Rong (2019c)
piControl Rong (2019d)
CAS CAS-ESM2-0 1pctCO2 Chai (2020c)
abrupt-4xCO2 Chai (2020d)
historical Chai (2020a)
piControl Chai (2020b)
FGOALS-f3-L 1pctCO2 Yu (2019a)
abrupt-4xCO2 Yu (2019b)
historical Yu (2019c)
piControl Yu (2019d)
CCCR-IITM IITM-ESM 1pctCO2 Raghavan and Panickal (2019a)
piControl Raghavan and Panickal (2019b)
CCCma CanESM5 1pctCO2 Swart et al. (2019h)
1pctCO2-bgc Swart et al. (2019e)
1pctCO2-rad Swart et al. (2019f)
abrupt-2xCO2 Cole et al. (2019)
abrupt-4xCO2 Swart et al. (2019i)
esm-hist Swart et al. (2019j)
esm-piControl Swart et al. (2019k)
esm-ssp585 Swart et al. (2019g)
hist-GHG Swart et al. (2019o)
hist-aer Swart et al. (2019n)
hist-nat Swart et al. (2019p)
hist-stratO3 Swart et al. (2019q)
historical Swart et al. (2019l)
piControl Swart et al. (2019m)
ssp119 Swart et al. (2019s)
ssp126 Swart et al. (2019t)
ssp245 Swart et al. (2019u)
ssp245-nat Swart et al. (2019r)
ssp370 Swart et al. (2019v)
ssp434 Swart et al. (2019w)
ssp460 Swart et al. (2019x)
ssp534-over Swart et al. (2019y)
ssp585 Swart et al. (2019z)
CanESM5-CanOE 1pctCO2 Swart et al. (2019b)
esm-hist Swart et al. (2019c)
esm-ssp585 Swart et al. (2019a)
historical Swart et al. (2019d)
piControl Swart et al. (2019aa)
ssp126 Swart et al. (2019ab)
ssp245 Swart et al. (2019ac)
ssp370 Swart et al. (2019ad)
ssp585 Swart et al. (2019ae)
CNRM-CERFACS CNRM-CM6-1 1pctCO2 Voldoire (2018a)
abrupt-0p5xCO2 Voldoire (2019g)
abrupt-2xCO2 Voldoire (2019h)
abrupt-4xCO2 Voldoire (2018b)
hist-GHG Voldoire (2019j)
hist-aer Voldoire (2019i)
hist-nat Voldoire (2019k)
historical Voldoire (2018c)
piControl Voldoire (2018d)
ssp126 Voldoire (2019l)
ssp245 Voldoire (2019m)
ssp370 Voldoire (2019n)
ssp585 Voldoire (2019o)
CNRM-CM6-1-HR 1pctCO2 Voldoire (2019a)
abrupt-4xCO2 Voldoire (2019b)
historical Voldoire (2019c)
piControl Voldoire (2019d)
ssp126 Voldoire (2020a)
ssp245 Voldoire (2019e)
ssp370 Voldoire (2020b)
ssp585 Voldoire (2019f)
CNRM-ESM2-1 1pctCO2 Seferian (2018c)
1pctCO2-bgc Seferian (2018a)
1pctCO2-rad Seferian (2018b)
abrupt-4xCO2 Seferian (2018d)
esm-hist Seferian (2019b)
esm-piControl Seferian (2019c)
historical Seferian (2018e)
piControl Seferian (2018f)
ssp119 Voldoire (2019p)
ssp126 Voldoire (2019q)
ssp245 Voldoire (2019r)
ssp370 Voldoire (2019s)
ssp370-lowNTCF Seferian (2019a)
ssp434 Voldoire (2019t)
ssp460 Voldoire (2019u)
ssp534-over Voldoire (2019v)
ssp585 Voldoire (2019w)
CSIRO ACCESS-ESM1-5 1pctCO2 Ziehn et al. (2019g)
1pctCO2-bgc Ziehn et al. (2019a)
1pctCO2-rad Ziehn et al. (2019b)
abrupt-4xCO2 Ziehn et al. (2019h)
esm-1pct-brch-1000PgC Ziehn et al. (2019c)
esm-1pct-brch-2000PgC Ziehn et al. (2019d)
esm-1pct-brch-750PgC Ziehn et al. (2019e)
esm-hist Ziehn et al. (2019i)
esm-piControl Ziehn et al. (2019j)
esm-ssp585 Ziehn et al. (2019f)
historical Ziehn et al. (2019k)
piControl Ziehn et al. (2019l)
ssp126 Ziehn et al. (2019m)
ssp245 Ziehn et al. (2019n)
ssp370 Ziehn et al. (2019o)
ssp585 Ziehn et al. (2019p)
CSIRO-ARCCSS ACCESS-CM2 1pctCO2 Dix et al. (2019a)
abrupt-4xCO2 Dix et al. (2019b)
historical Dix et al. (2019c)
piControl Dix et al. (2019d)
ssp126 Dix et al. (2019e)
ssp245 Dix et al. (2019f)
ssp370 Dix et al. (2019g)
ssp585 Dix et al. (2019h)
E3SM-Project E3SM-1-0 1pctCO2 Bader et al. (2019a)
abrupt-4xCO2 Bader et al. (2019b)
historical Bader et al. (2019c)
piControl Bader et al. (2018)
historical Bader et al. (2019d)
piControl Bader et al. (2019e)
E3SM-1-1-ECA historical Bader et al. (2020)
piControl Bader et al. (2019f)
EC-Earth-Consortium EC-Earth3 1pctCO2 EC-Earth Consortium (EC-Earth) (2019b)
abrupt-4xCO2 EC-Earth Consortium (EC-Earth) (2019c)
historical EC-Earth Consortium (EC-Earth) (2019d)
piControl EC-Earth Consortium (EC-Earth) (2019e)
ssp119 EC-Earth Consortium (EC-Earth) (2019f)
ssp126 EC-Earth Consortium (EC-Earth) (2019g)
ssp245 EC-Earth Consortium (EC-Earth) (2019h)
ssp585 EC-Earth Consortium (EC-Earth) (2019i)
EC-Earth3-LR piControl EC-Earth Consortium (EC-Earth) (2019a)
EC-Earth3-Veg 1pctCO2 EC-Earth Consortium (EC-Earth) (2019j)
abrupt-4xCO2 EC-Earth Consortium (EC-Earth) (2019k)
historical EC-Earth Consortium (EC-Earth) (2019l)
piControl EC-Earth Consortium (EC-Earth) (2019m)
ssp119 EC-Earth Consortium (EC-Earth) (2019n)
ssp126 EC-Earth Consortium (EC-Earth) (2019o)
ssp245 EC-Earth Consortium (EC-Earth) (2019p)
ssp370 EC-Earth Consortium (EC-Earth) (2019q)
ssp585 EC-Earth Consortium (EC-Earth) (2019r)
EC-Earth3-Veg-LR historical EC-Earth Consortium (EC-Earth) (2020a)
piControl EC-Earth Consortium (EC-Earth) (2020b)
FIO-QLNM FIO-ESM-2-0 1pctCO2 Song et al. (2020a)
abrupt-4xCO2 Song et al. (2020b)
historical Song et al. (2019a)
piControl Song et al. (2019b)
IPSL IPSL-CM6A-LR 1pctCO2 Boucher et al. (2018e)
1pctCO2-bgc Boucher et al. (2018a)
1pctCO2-rad Boucher et al. (2018b)
abrupt-0p5xCO2 Boucher et al. (2018c)
abrupt-2xCO2 Boucher et al. (2018d)
abrupt-4xCO2 Boucher et al. (2018f)
hist-GHG Boucher et al. (2018j)
hist-aer Boucher et al. (2018i)
hist-nat Boucher et al. (2018k)
hist-stratO3 Boucher et al. (2018l)
historical Boucher et al. (2018g)
piControl Boucher et al. (2018h)
ssp119 Boucher et al. (2019a)
ssp126 Boucher et al. (2019b)
ssp245 Boucher et al. (2019c)
ssp370 Boucher et al. (2019d)
ssp434 Boucher et al. (2019e)
ssp460 Boucher et al. (2019f)
ssp534-over Boucher et al. (2019g)
ssp585 Boucher et al. (2019h)
MIROC MIROC-ES2L 1pctCO2 Hajima et al. (2019a)
1pctCO2-bgc Hajima et al. (2019e)
1pctCO2-rad Hajima et al. (2019f)
abrupt-4xCO2 Hajima et al. (2019b)
esm-hist Hajima et al. (2020a)
esm-piControl Hajima et al. (2020b)
historical Hajima et al. (2019c)
piControl Hajima et al. (2019d)
ssp119 Tachiiri et al. (2019a)
ssp126 Tachiiri et al. (2019b)
ssp245 Tachiiri et al. (2019c)
ssp370 Tachiiri et al. (2019d)
ssp585 Tachiiri et al. (2019e)
MIROC6 1pctCO2 Tatebe and Watanabe (2018a)
abrupt-0p5xCO2 Ogura et al. (2019a)
abrupt-2xCO2 Ogura et al. (2019b)
abrupt-4xCO2 Tatebe and Watanabe (2018b)
hist-GHG Shiogama (2019b)
hist-aer Shiogama (2019a)
hist-nat Shiogama (2019c)
hist-stratO3 Shiogama (2019d)
historical Tatebe and Watanabe (2018c)
piControl Tatebe and Watanabe (2018d)
ssp119 Shiogama et al. (2019a)
ssp126 Shiogama et al. (2019b)
ssp245 Shiogama et al. (2019c)
ssp245-nat Shiogama (2019e)
ssp370 Shiogama et al. (2019d)
ssp370-lowNTCF Takemura (2019)
ssp434 Shiogama et al. (2019e)
ssp460 Shiogama et al. (2019f)
ssp534-over Shiogama et al. (2019g)
ssp585 Shiogama et al. (2019h)
MOHC HadGEM3-GC31-LL 1pctCO2 Ridley et al. (2019a)
abrupt-4xCO2 Ridley et al. (2019b)
hist-GHG Jones (2019g)
hist-aer Jones (2019f)
hist-nat Jones (2019h)
historical Ridley et al. (2019c)
piControl Ridley et al. (2018)
ssp126 Good (2020a)
ssp245 Good (2019)
ssp585 Good (2020b)
HadGEM3-GC31-MM 1pctCO2 Ridley et al. (2020a)
abrupt-4xCO2 Ridley et al. (2020b)
historical Ridley et al. (2019d)
piControl Ridley et al. (2019e)
UKESM1-0-LL 1pctCO2 Tang et al. (2019a)
1pctCO2-bgc Jones (2019a)
1pctCO2-rad Jones (2019b)
abrupt-4xCO2 Tang et al. (2019b)
esm-1pct-brch-1000PgC Jones (2020a)
esm-1pct-brch-2000PgC Jones (2019c)
esm-1pct-brch-750PgC Jones (2019d)
esm-hist Tang et al. (2019c)
esm-piControl Tang et al. (2019d)
esm-ssp534-over Jones et al. (2020)
esm-ssp585 Jones (2019e)
historical Tang et al. (2019e)
piControl Tang et al. (2019f)
ssp119 Good et al. (2019a)
ssp126 Good et al. (2019b)
ssp245 Good et al. (2019c)
ssp370 Good et al. (2019d)
ssp434 Good et al. (2019e)
ssp534-over Good et al. (2019f)
ssp585 Good et al. (2019g)
MPI-M MPI-ESM1-2-HR 1pctCO2 Jungclaus et al. (2019a)
abrupt-4xCO2 Jungclaus et al. (2019b)
historical Jungclaus et al. (2019c)
piControl Jungclaus et al. (2019d)
MPI-ESM1-2-LR 1pctCO2 Wieners et al. (2019e)
1pctCO2-bgc Brovkin et al. (2019a)
1pctCO2-rad Brovkin et al. (2019b)
abrupt-4xCO2 Wieners et al. (2019f)
esm-hist Wieners et al. (2019g)
esm-piControl Wieners et al. (2019h)
esm-ssp585 Brovkin et al. (2019c)
historical Wieners et al. (2019i)
piControl Wieners et al. (2019j)
ssp126 Wieners et al. (2019a)
ssp245 Wieners et al. (2019b)
ssp370 Wieners et al. (2019c)
ssp585 Wieners et al. (2019d)
MRI MRI-ESM2-0 1pctCO2 Yukimoto et al. (2019e)
1pctCO2-bgc Yukimoto et al. (2019b)
1pctCO2-rad Yukimoto et al. (2019c)
abrupt-0p5xCO2 Yukimoto et al. (2020a)
abrupt-2xCO2 Yukimoto et al. (2020b)
abrupt-4xCO2 Yukimoto et al. (2019f)
esm-hist Yukimoto et al. (2019g)
esm-piControl Yukimoto et al. (2019h)
esm-ssp585 Yukimoto et al. (2019d)
hist-GHG Yukimoto et al. (2019l)
hist-aer Yukimoto et al. (2019k)
hist-nat Yukimoto et al. (2019m)
hist-stratO3 Yukimoto et al. (2020c)
historical Yukimoto et al. (2019i)
piControl Yukimoto et al. (2019j)
ssp119 Yukimoto et al. (2019n)
ssp126 Yukimoto et al. (2019o)
ssp245 Yukimoto et al. (2019p)
ssp370 Yukimoto et al. (2019q)
ssp370-lowNTCF Yukimoto et al. (2019a)
ssp434 Yukimoto et al. (2019r)
ssp460 Yukimoto et al. (2019s)
ssp534-over Yukimoto et al. (2019t)
ssp585 Yukimoto et al. (2019u)
NASA-GISS GISS-E2-1-G 1pctCO2 NASA Goddard Institute For Space Studies (NASA/GISS) (2018b)
1pctCO2-bgc NASA Goddard Institute For Space Studies (NASA/GISS) (2019h)
1pctCO2-rad NASA Goddard Institute For Space Studies (NASA/GISS) (2019i)
abrupt-0p5xCO2 NASA Goddard Institute For Space Studies (NASA/GISS) (2019j)
abrupt-2xCO2 NASA Goddard Institute For Space Studies (NASA/GISS) (2018a)
abrupt-4xCO2 NASA Goddard Institute For Space Studies (NASA/GISS) (2018c)
hist-GHG NASA Goddard Institute For Space Studies (NASA/GISS) (2018g)
hist-aer NASA Goddard Institute For Space Studies (NASA/GISS) (2018f)
hist-nat NASA Goddard Institute For Space Studies (NASA/GISS) (2018h)
historical NASA Goddard Institute For Space Studies (NASA/GISS) (2018d)
piControl NASA Goddard Institute For Space Studies (NASA/GISS) (2018e)
ssp245 NASA Goddard Institute For Space Studies (NASA/GISS) (2020a)
ssp370 NASA Goddard Institute For Space Studies (NASA/GISS) (2020b)
ssp585 NASA Goddard Institute For Space Studies (NASA/GISS) (2020c)
GISS-E2-1-G-CC esm-1pctCO2 NASAGoddard Institute For Space Studies (NASA/GISS) (2019a)
historical NASA Goddard Institute For Space Studies (NASA/GISS) (2019b)
piControl NASA Goddard Institute For Space Studies (NASA/GISS) (2019c)
GISS-E2-1-H 1pctCO2 NASA Goddard Institute For Space Studies (NASA/GISS) (2019l)
abrupt-2xCO2 NASA Goddard Institute For Space Studies (NASA/GISS) (2019k)
abrupt-4xCO2 NASA Goddard Institute For Space Studies (NASA/GISS) (2019m)
historical NASA Goddard Institute For Space Studies (NASA/GISS) (2019n)
piControl NASA Goddard Institute For Space Studies (NASA/GISS) (2018i)
GISS-E2-2-G 1pctCO2 NASA Goddard Institute For Space Studies (NASA/GISS) (2019e)
abrupt-2xCO2 NASA Goddard Institute For Space Studies (NASA/GISS) (2019d)
abrupt-4xCO2 NASA Goddard Institute For Space Studies (NASA/GISS) (2019f)
piControl NASA Goddard Institute For Space Studies (NASA/GISS) (2019g)
NCAR CESM2 1pctCO2 Danabasoglu (2019d)
1pctCO2-bgc Danabasoglu (2019c)
abrupt-0p5xCO2 Danabasoglu (2020c)
abrupt-2xCO2 Danabasoglu (2020d)
abrupt-4xCO2 Danabasoglu (2019e)
esm-hist Danabasoglu (2019f)
esm-piControl Danabasoglu (2019g)
hist-GHG Danabasoglu (2019i)
hist-aer Danabasoglu (2020e)
hist-nat Danabasoglu (2019j)
historical Danabasoglu (2019h)
piControl Danabasoglu et al. (2019)
ssp126 Danabasoglu (2019k)
ssp245 Danabasoglu (2019l)
ssp245-nat Danabasoglu (2020f)
ssp370 Danabasoglu (2019m)
ssp585 Danabasoglu (2019n)
CESM2-FV2 1pctCO2 Danabasoglu (2020a)
abrupt-4xCO2 Danabasoglu (2020b)
historical Danabasoglu (2019a)
piControl Danabasoglu (2019b)
CESM2-WACCM 1pctCO2 Danabasoglu (2019r)
abrupt-4xCO2 Danabasoglu (2019s)
historical Danabasoglu (2019t)
piControl Danabasoglu (2019u)
ssp126 Danabasoglu (2019v)
ssp245 Danabasoglu (2019w)
ssp370 Danabasoglu (2019x)
ssp370-lowNTCF Danabasoglu (2019q)
ssp534-over Danabasoglu (2019y)
ssp585 Danabasoglu (2019z)
CESM2-WACCM-FV2 1pctCO2 Danabasoglu (2020g)
abrupt-4xCO2 Danabasoglu (2020h)
historical Danabasoglu (2019o)
piControl Danabasoglu (2019p)
NCC NorCPM1 1pctCO2 Bethke et al. (2019a)
abrupt-4xCO2 Bethke et al. (2019b)
historical Bethke et al. (2019c)
piControl Bethke et al. (2019d)
NorESM1-F piControl Guo et al. (2019)
NorESM2-LM 1pctCO2 Seland et al. (2019a)
1pctCO2-rad Schwinger et al. (2020)
abrupt-4xCO2 Seland et al. (2019b)
esm-1pct-brch-1000PgC Schwinger et al. (2019)
esm-hist Seland et al. (2019c)
esm-piControl Seland et al. (2019d)
hist-GHG Seland et al. (2019h)
hist-aer Seland et al. (2019g)
hist-nat Seland et al. (2019i)
historical Seland et al. (2019e)
piControl Seland et al. (2019f)
NorESM2-MM 1pctCO2 Bentsen et al. (2019a)
abrupt-4xCO2 Bentsen et al. (2019b)
historical Bentsen et al. (2019c)
piControl Bentsen et al. (2019d)
NIMS-KMA KACE-1-0-G historical Byun et al. (2019)
UKESM1-0-LL historical Byun (2020)
NOAA-GFDL GFDL-CM4 1pctCO2 Guo et al. (2018a)
abrupt-4xCO2 Guo et al. (2018b)
historical Guo et al. (2018c)
piControl Guo et al. (2018d)
GFDL-ESM4 1pctCO2 Krasting et al. (2018d)
1pctCO2-bgc Krasting et al. (2018a)
1pctCO2-rad Krasting et al. (2018b)
abrupt-4xCO2 Krasting et al. (2018e)
esm-hist Krasting et al. (2018f)
esm-piControl Krasting et al. (2018g)
esm-ssp585 Krasting et al. (2018c)
hist-GHG Horowitz et al. (2018b)
hist-aer Horowitz et al. (2018a)
hist-nat Horowitz et al. (2018c)
historical Krasting et al. (2018h)
piControl Krasting et al. (2018i)
ssp119 John et al. (2018a)
ssp126 John et al. (2018b)
ssp245 John et al. (2018e)
ssp370 John et al. (2018c)
ssp585 John et al. (2018d)
NUIST NESM3 1pctCO2 Cao and Wang (2019a)
abrupt-4xCO2 Cao and Wang (2019b)
historical Cao and Wang (2019c)
piControl Cao and Wang (2019d)
SNU SAM0-UNICON 1pctCO2 Park and Shin (2019a)
abrupt-4xCO2 Park and Shin (2019b)
historical Park and Shin (2019c)
piControl Park and Shin (2019d)
THU CIESM 1pctCO2 Huang (2020a)
abrupt-4xCO2 Huang (2020b)
historical Huang (2019a)
piControl Huang (2019b)
ssp126 Huang (2019c)
ssp585 Huang (2020c)

Here, we comply with the CMIP5 and CMIP6 conditions of use by providing a summary of the data sources we have processed to date in Tables 1 and 2 and including the required acknowledgements statements. Tables 1 and 2 have been produced automatically from the source data using netCDF-SCM (documentation at netcdf-scm.readthedocs.io/en/latest/usage/using-cmip-data.html, last accessed 25 June 2020). In addition, we provide tools to check whether a dataset has been retracted (see Section 2.2.3). We hope this automation can be useful to other researchers too and enhances other existing CMIP6 functionality given our extensive efforts to follow ‘CMIP6 data management and integrity objectives’ (Balaji et al., 2018).

2.2 Methods

In principle, our derived dataset is easy to produce. It consists of weighted means of different horizontal area sections of the raw data, which are then joined together to create timeseries which span both the historical and future scenario simulations. In practice, the collection, joining and processing of the huge data volumes turns it into a non-trivial effort.

To address this in an automated way, we have developed the netCDF-SCM package gitlab.com/netcdf-scm/netcdf-scm, available under the open source initiative approved BSD-3-Clause licence (see https://opensource.org/licenses and Morin et al. (2012) for more information on open source software licences for scientists). The netCDF-SCM package builds on the Iris package (Met Office, 2019) and relies heavily on the numerous capabilities Iris offers. We welcome the addition of new features to the netCDF-SCM package (its development is hosted at gitlab.com/netcdf-scm/netcdf-scm).

We have validated netCDF-SCM in a number of ways. Firstly, we have compared our global-mean calculations with the limited set of data available on the KNMI climate explorer (climexp.knmi.nl/start.cgi, last accessed 7th October 2020) and found them to be within 0.1% in all cases. Secondly, we have run an extensive test suite over the entire netCDF-SCM package (the test suite is run with every update to the software in a process known as ‘continuous integration’, see, e.g., https://about.gitlab.com/blog/2018/01/22/a-beginners-guide-to-continuous-integration/). This test suite includes unit tests (tests of individual bits of functionality in isolation from the rest of the package), integration tests (end-to-end tests of the behaviour of the package) and regression tests (tests of changes in outputs of the package compared to previous versions). Combined, these tests cover 98% of the codebase, with much of the package being run many times over as part of the testing process.

2.2.1 Weighted means

We combine raw data, cell areas and cell surface fraction information to produce appropriately weighted means for each region of interest (Figure 1). We calculate the weighted means as follows:
v ( r , t , Z ) = Y X ( t , Y , Z ) × A ( Y , Z ) × f ( r , Y , Z ) Y A ( Y , Z ) × f ( r , Y , Z ) (1)
where v ( r , t , Z ) is the output for horizontal region of interest r at time t and non-horizontal spatial coordinate vector Z , Y is the horizontal spatial coordinate of each cell, X ( t , Y , Z ) is the raw data at time t in the cell with horizontal spatial coordinate vector Y and non-horizontal spatial coordinate vector Z , A ( Y , Z ) is the horizontal area of the cell, and f ( r , Y , Z ) is the relevant surface fraction of the cell for the region of interest. Here, ‘horizontal region’ refers to a region defined by latitude and longitude only.
Details are in the caption following the image
Data processing workflow. Firstly, raw data are combined with cell area and surface fraction information to derive area weighted masks for each region. The region-specific masks are then combined with the raw CMIP6 data to derive weighted-mean timeseries for each region of interest (here NH stands for ‘Northern Hemisphere’ and SH stands for ‘Southern Hemisphere’). The resulting timeseries require a fraction of the original data's volume hence are much easier to handle for non-experts

The cell area and surface fractions specific to the climate model of interest are determined using the CMIP data reference syntax and the CMIP output metadata. If the cell area and/or surface fraction information is not available, netCDF-SCM will simply use inbuilt best-guess cell areas and surface fractions. The best-guesses cell areas use Iris' (MetOffice, 2019) functionality to calculate cell areas based on each cell's bounds (cell areas are calculated as r 2 ( l o n 1 - l o n 0 ) ( sin ( l a t 1 ) - sin ( l a t 0 ) ) , where r is the Earth's radius, l o n 1 and l o n 0 are the longitude (east–west) bounds of the cell, and l a t 1 and l a t 0 are the latitude (north–south) bounds of the cell). The best-guess surface fractions are the result of interpolating the surface fractions from IPSL-CM6A-LR climate model historical simulations (Boucher et al., 2018g) onto a regular 1 × 1 grid. When calculating weighted averages, these best-guess surface fractions are then interpolated onto the model of interest's grid. This process works best with data on a regular grid. Given their complexity, on some models' native grids it may be necessary to obtain the cell area and surface fraction information before the data can be processed. We have examined the importance of the choice of best-guess surface fractions and found that basing our best-guess surface fractions on other climate models would make a negligible difference (<1%) to our results. In cases where the cell area and surface fraction are particularly important (e.g. for small regions), we stress that the most accurate results can only be obtained by using the cell area and surface fractions specific to the climate model of interest.

We also ensure that we use the surface fractions appropriate to the domain of the data of interest ( f in Equation (1)). For atmospheric data, we simply use surface land fractions for land regions and surface ocean fractions for ocean regions. For land- and ocean-specific data, more care is required because there are two different options. Assume we're considering land-specific data. The first option is to include the land surface fractions in the weights. This choice means that each cell is weighted by the area of land in the cell, not by the total area of the cell. The second is to only consider whether any of the cell is land. If any of the cell is land, then we set f in Equation (1) equal to one; otherwise, we set f equal to zero otherwise. This choice means that each cell is weighted by the total area of the cell, unless it is not land at all in which case it receives zero weight. The same logic can be applied to ocean-specific data, using ocean fractions instead of land fractions as required.

For land data, for example data from C4MIP (Jones et al., 2016), we use the first option; that is, we weight by area and surface land fractions in our calculations. These surface fraction weights are important to apply because ‘you do need to weight the output by land frac (sftlf is the CMIP variable name)’ (Jones, 2020b) personalcomm in order to calculate weighted means correctly. In contrast, for ocean data we use the second option; that is, we weight only by area, having first assigned a weight of zero to any cells, which do not contain any ocean. We must apply this logic because ocean-specific data are given with respect to the horizontal area of the entire cell, not the horizontal area of ocean in the cell (Griffies et al., 2016).

This step is the most computationally expensive step. Being built on Iris (Met Office, 2019), netCDF-SCM is able to handle large datasets, largely thanks to the dask package (Dask Development Team, 2016). With parallel processing and user-defined memory usage settings, netCDF-SCM is able to be run on personal computers as well as cloud high performance computing infrastructures.

2.2.2 Stitching experiments

CMIP data typically come in a ‘family’ of experiments, with each experiment having a ‘parent’ experiment and potentially ‘children’, ‘grandchildren’, ‘great-grandchildren’, etc. Each ‘generation’ can be joined with the previous to make a longer, continuous timeseries.

An obvious example of this is the ScenarioMIP (O'Neill et al., 2016) experiments. ScenarioMIP includes simulations of future scenarios. In order to create a complete timeseries from pre-industrial times through to the future, the scenario simulations must be joined with the historical simulations (their parent experiment) and the pre-industrial control (piControl) simulations. The need for joining experiments in this way, or ‘stitching’ them, appears beyond scenario simulations too. For example, many ZECMIP (Jones et al., 2019) experiments are the children of the one per cent per year increase in atmospheric CO2 concentration experiments (1pctCO2).

Stitching experiments together back to pre-industrial control runs requires a number of steps (Algorithm 1). It is necessary to combine the metadata provided in each file with the data reference syntax to efficiently traverse the data archive to find each relevant output. Then, the branch times in each experiment must be checked and aligned to ensure that the continuous timeseries is as intended. For full provenance, the metadata from each generation that has contributed to the ‘stitched’ output should also be preserved. Performing these steps is another one of netCDF-SCM's key functions, leading to continuous stitched outputs with complete metadata of all datasets, which have been used to make that output.

Algorithm 1.Algorithm for stitching and normalizing. This algorithm does two things. Firstly, it joins together experiments which form a continuous sequence (e.g. a scenario-based experiment, which continues from a historical experiment). Secondly, it can, if desired, normalize experiments against pre-industrial control values. This allows users to, for example, calculate deviations from the background state or remove model drift from their output timeseries.

These stitched timeseries are a key part of our dataset. They contain many key climate signals, such as interannual (for monthly data) and multi-decadal variability.

However, there can be an extra step. The extra step is de-drifting or applying ‘normalization’. In this context, normalization refers to calculating anomalies from some reference values. In some cases, this is trivial. For example, taking anomalies from a given period within the existing output, for example, calculating anomalies relative to the 1850–1900 period.

Nonetheless, there are many cases which require more complex analyses. For example, calculating anomalies relative to a 21-year (or 30-year) running mean of the equivalent period in the pre-industrial control run. Such a calculation requires finding the pre-industrial control data, identifying the equivalent period in the pre-industrial control run, correctly lining it up with the data to be normalized before finally calculating the anomalies. This can be done for single ensemble members (Figure 2) and for multiple members (in which case particular care must be taken to normalize each ensemble member against the correct period from the pre-industrial control run, Figure S1). For small subsets of the data, this can be done manually; however, automated solutions are necessary to perform it over an entire CMIP archive.

Details are in the caption following the image
Joining of timeseries (‘stitching’) and calculation of anomalies from the pre-industrial control runs (‘normalization’). Following CMIP6 terminology, we show here a ‘family’ of experiments, with G6solar being the child, ssp585 being the parent, historical being the grandparent and piControl being the great-grandparent. (a) raw data on its native time axes; (b) raw data alongside the pre-industrial control run, with the pre-industrial control run's time axis shifted so the branch point (i.e. the point in time at which an experiment branches from its parent, in this case the point in time at which the historical experiment branches from the piControl experiment) occurs at the same time in the pre-industrial control run and the historical experiment; (c) final stitched and normalized output. In this case, the stitched and normalized output comprises the historical experiment from 1850–2014, the ssp585 experiment from 2015–2019 and the G6solar experiment from 2020–2100 and has been normalized against a 21-year running mean of the pre-industrial control experiment (red line in panel b)

Within our dataset, we currently offer four different normalization options (in addition to the outputs which have had no normalization applied). Before describing these options, we stress that all of our processing data are openly available so users who wish to use different normalization options to the ones provided are able to do so. Our four different normalization options are anomalies and de-drifting with 21- and 30-year running means as reference values. The anomalies are calculated as the difference between the model output and the corresponding running mean from the pre-industrial control experiment. In contrast, the de-drifting calculations simply remove any drift in the running mean of the pre-industrial control experiment, without calculating differences. The anomaly calculations are useful for variables such as tas (surface air temperature), where changes in the variable from the pre-industrial state are of most interest. The de-drifting calculations are used for variables such as cLand (total carbon in all terrestrial pools), where the absolute values are of importance, but it is also important to remove model drift before performing further analysis.

For all normalization options, we use an equation of the form:
v norm ( r , t ) = v raw ( r , t ) - v pi ( r , t ) (2)
where v norm ( r , t ) is the normalized values for region r at time t, v raw ( r , t ) is the raw values and is v pi ( r , t ) the running-mean reference values from the pre-industrial control run (either absolute values or drift values depending on the normalization method).

2.2.3 Retracted data

CMIP data are occasionally found to be erroneous and hence retracted (Balaji et al., 2018). To handle this, as part of netCDF-SCM, we provide a simple tool which checks if a data file is based on any retracted data (see https://netcdf-scm.readthedocs.io/en/latest/usage/using-cmip-data.html, last accessed 25 June 2020). In addition, the tool will also examine the data's licence and, to the extent possible, warn the user about any non-standard licence terms.

These tools take advantage of CMIP6's ‘dataset-centric rather than system centric’ approach (Balaji et al., 2018). The dataset-centric approach allows data users to check the validity of their data at point of use, rather than relying on the data provider to have done this for them. The dataset-centric approach also ensures that ‘dark repositories’ (Balaji et al., 2018), such as our derived dataset, maintain the connection between data user and the original source of the data. In our derived dataset, we maintain this connection by providing the persistent identifiers of all datasets within our metadata (specifically the tracking_id attributes).

3 DATASET ACCESS

The dataset described in this paper is openly available at https://doi.org/10.5281/zenodo.3951890, and all our data processing code is openly available at https://gitlab.com/netcdf-scm/calibration-data. Any use of the data must follow CMIP's terms of use (see discussion in Section 2.1). At present, our dataset contains timeseries for over 100 models and 40 experiments of interest from the CMIP5 and CMIP6 archives. In total, we have over 40,000 timeseries. To date, we have processed 83 variables, full descriptions of which are available in the ‘Model output specifications’ section of https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html (last accessed 8th October 2020). If users would like extra variables, we are happy to discuss adding more into our dataset. Our current variable list is: energy flux and temperature variables – hfds, rlut, rsdt, rsut, tas, tasmin, tasmax, tos, ts; precipitation – pr; carbon cycle related variables – cLand, cLitter, cLitterGrass, cLitterShrub, cLitterSubSurf, cLitterSurf, cLitterTree, cMisc, cOther, cProduct, cSoil, cSoilFast, cSoilMedium, cSoilSlow, cStem, cVeg, cVegGrass, cVegShrub, cVegTree, cWood, co2mass, co2s, fAnthDisturb, fBNF, fDeforestToAtmos, fDeforestToProduct, fFire, fFireAll, fFireNat, fGrazing, fHarvest, fHarvestToAtmos, fHarvestToProduct, fLitterSoil, fLuc, fNAnthDisturb, fNLitterSoil, fNProduct, fNVegLitter, fNdep, fNfert, fNgas, fNgasFire, fNgasNonFire, fNloss, fNnetmin, fNup, fVegLitter, fco2antt, fco2fos, fco2nat, fgco2, gpp, nbp, nep, netAtmosLandCO2Flux, npp, nppGrass, nppOther, nppShrub, nppTree, ra, rh, rhGrass, rhLitter, rhShrub, rhSoil, rhTree; nitrogen cycle related variables – nLand, nLeaf, nLitter, nMineral, nProduct, nRoot, nSoil, nStem. We include output for 11 large-scale regions (World, Northern Hemisphere, Southern Hemisphere, ocean, land, Northern Hemisphere ocean, Northern Hemisphere land, Southern Hemisphere ocean, Southern Hemisphere land, North Atlantic Ocean and the El Niño 3.4 region, defined as the region within 5 N–5°S and 170 W–120 W) and, by using the regionmask package (Hauser et al., 2020), all of the IPCC climate reference regions defined by Iturbide et al. (2020).

The timeseries are continuous, monthly timeseries for specific variables, experiments and regions of interest for a selection of the CMIP5 and CMIP6 archives. These timeseries are the combination of the experiment of interest, any (potentially multiple) experiments from which it ‘branched’ and any normalization, which is applied (Section 2.2 and Figure 2).

Our dataset is a different product compared to the data available in the IPCC AR6 Atlas (https://github.com/SantanderMetGroup/ATLAS, last accessed 9th October 2020). At present, the only similarity is that both our dataset and the Atlas provide surface air temperature (tas) at various regional aggregations for tier 1 experiments from CMIP5 and CMIP6. However, while the Atlas uses a binary land/ocean mask (i.e. each value is a one or a zero) and cosine of latitude as a proxy for area weights, we use the model reported cell areas (where available) and a continuous land fraction (including subtleties for land-only and ocean-only data, see Section 2.2.1) to calculate weighted, aggregate metrics. Secondly, we provided stitched outputs, joining each experiment with its parent, grandparent etc. experiments. Thirdly, the Atlas provides precipitation (pr) timeseries, whereas we do not provide precipitation. Instead, we provide data for 82 other variables as described previously. Finally, the Atlas provides one ensemble member per climate model, while we provide as many ensemble members as are available.

We provide our data in a comma separated value (csv) format. This format is composed of three key parts and uses the extension. MAG because it is directly compatible with the MAGICC7 reduced complexity climate model (Meinshausen et al., 2019). The first part is the header, which contains the date the file was written, the contact for the file, the version of netCDF-SCM used to crunch the data and the version of Pymagicc (Gieseke et al., 2018) used to write the file. The second is the metadata. This contains all metadata from each of the raw datafiles, plus extra information and metadata about the method used to derive the final timeseries included in the file. It also contains a FORTRAN90 name list with basic information about the data in the file. A particularly useful bit of information is the THISFILE_FIRSTDATAROW line, which allows automated readers to skip to the line of interest if they are only interested in the data. The third and final section is the data. The data block is composed of a four-line header with variable, units and region information for each timeseries as well as a MAGICC7-specific row, TODO, which can generally be ignored. After the header comes the data itself. The data block has column-oriented data, with the first column being the time axis (in years) and each subsequent column being a different timeseries (sometimes referred to as ‘wide’ data although this term is imprecise (Wickham, 2014).

The data archive grows as we add new CMIP6 results. An up-to-date full collection (alongside instructions for automated downloads) can be found at https://cmip6.science.unimelb.edu.au. Examples of how to use the data can be found in https://gitlab.com/netcdf-scm/calibration-data/-/tree/master/notebooks (last accessed 25 June 2020), and we encourage any users of the data to add further examples, especially in computing languages other than Python.

4 POTENTIAL DATASET USE AND REUSE

The key users of this dataset are reduced complexity climate modellers. These regional-aggregate timeseries are a key part of model calibration (see, e.g., Meinshausen et al. (2011)) and comprehensive datasets allow reduced complexity models to be validated over a wide range of experiments and output variables.

Having said this, we believe that the dataset can be useful well-beyond the reduced complexity climate model community. As discussed in Section 1, processing CMIP data is an intimidating task for expert users and not possible for those without specialist training. We hope that our aggregate dataset removes this need for specialist training, thanks to its significantly reduced data volume and text-based format. As a result, the dataset presented here may be useful to climate change researchers outside the climate modelling community, policymakers, businesses and even journalists.

The dataset presented here comes with three important caveats. The first is that we make no guarantees about how up-to-date our data is. As discussed previously, the onus is on users of the data to check for retractions before using the data (see Section 2.2.3 for discussion of our automated tool for checking such retractions). The second is that the area-weighting used (Equation (1)) is only one of many possible area-weighting choices. For example, other users may wish to partition data into area/land boxes based on whether the fraction of each gridbox is above some threshold or not. At present, we do not provide data for area-weighting choices beyond the one described in this paper. For users who need to do such analysis, we are able to provide guidance on how this could be done with netCDF-SCM via netCDF-SCM's issue tracker (https://gitlab.com/netcdf-scm/netcdf-scm/-/issues). Thirdly, we provide only a limited number of ways of calculating anomalies. Again, for users who wish to calculate anomalies in a different way to what we have provided, netCDF-SCM's issue tracker can be used for discussions and guidance.

On the software side, the netCDF-SCM tool is in its relative infancy and is currently only developed by a limited community. As a result, many improvements could be made. We hope that netCDF-SCM's open source nature, with its extensive tests, invites contributions from throughout the climate community and beyond. Such contributions will improve netCDF-SCM's functionality and reduces the need for duplicate effort.

As a first suggestion, we note that much of netCDF-SCM's functionality is a duplication of functionality within the ESMValTool, ‘A community diagnostic and performance metrics tool for routine evaluation of Earth system models in CMIP’ (Eyring et al., 2016b). The duplication results from the parallel development of these two projects, which were both too immature to be combined when they were begun. Adding netCDF-SCM's functionality into the ESMValTool, which is a much bigger and better supported project, would reduce this duplication and likely provide benefits for both groups.

Outside of integration with the ESMValTool, improvements could be made to netCDF-SCM's memory usage, dask usage and parallelization. These may lead to significant processing performance as such optimizations, particularly the use of dask's task planning capabilities, have only been performed to a limited degree to date.

5 CONCLUSIONS

We have presented a dataset of monthly, global, hemispheric and land/ocean means based on the CMIP5 and CMIP6 archives. The dataset is aimed at reduced complexity climate modellers, but may also be useful for many other researchers. Our dataset joins the different levels of experiments, reducing the need for users to manage and join multiple separate datasets before they can be used. This dataset is orders of magnitude smaller than the raw datasets themselves hence can be managed much more easily by non-expert users. In addition, we provide the dataset in a text-based format, which removes the need for users to be familiar with the netCDF format before they can use the data. We hope this facilitates use by groups outside the science community, for example policymakers, actuaries and journalists.

We add new CMIP6 results to our dataset regularly and simultaneously remove any data derived from retracted datasets. We also provide simple tools to check whether a user's data have been retracted since they downloaded the data and to highlight any non-standard licence terms so that users can make sure they have the most up-to-date information possible.

We hope this can be a great community resource, which builds on the community efforts of the netCDF format (Unidata, 2020), Iris (Met Office, 2019) and generations of CMIP (Taylor et al., 2012; Eyring et al., 2016a).

ACKNOWLEDGEMENTS

We acknowledge the World Climate Research Programme, which, through its Working Group on Coupled Modelling, coordinated and promoted CMIP6. We thank the climate modelling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the data and providing access, and the multiple funding agencies who support CMIP6 and ESGF. We acknowledge the World Climate Research Programme's Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modelling groups (listed in Table 1 of this paper) for producing and making available their model output. For CMIP, the U.S. Department of Energy's Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. We thank the authors of Iris for all of their efforts in handling netCDF files. Without their efforts, netCDF-SCM would not have been possible. The work was undertaken in collaboration with the Melbourne Data Analytics Platform (MDAP) at the University of Melbourne. The work was also supported by Science IT at the University of Melbourne. This research/project was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government. ZN benefited from support provided by the ARC Centre of Excellence for Climate Extremes (CE170100023).

    CONFLICTS OF INTEREST

    The authors declare that they have no conflict of interest.

    OPEN PRACTICES

    This article has earned an Open Data badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available at https://doi.org/10.5281/zenodo.4536523 Learn more about the Open Practices badges from the Center for OpenScience: https://osf.io/tvyxz/wiki.