Regionally aggregated, stitched and de‐drifted CMIP‐climate data, processed with netCDF‐SCM v2.0.0

The world's most complex climate models are currently running a range of experiments as part of the Sixth Coupled Model Intercomparison Project (CMIP6). Added to the output from the Fifth Coupled Model Intercomparison Project (CMIP5), the total data volume will be in the order of 20PB. Here, we present a dataset of annual, monthly, global, hemispheric and land/ocean means derived from a selection of experiments of key interest to climate data analysts and reduced complexity climate modellers. The derived dataset is a key part of validating, calibrating and developing reduced complexity climate models against the behaviour of more physically complete models. In addition to its use for reduced complexity climate modellers, we aim to make our data accessible to other research communities. We facilitate this in a number of ways. Firstly, given the focus on annual, monthly, global, hemispheric and land/ocean mean quantities, our dataset is orders of magnitude smaller than the source data and hence does not require specialized ‘big data’ expertise. Secondly, again because of its smaller size, we are able to offer our dataset in a text‐based format, greatly reducing the computational expertise required to work with CMIP output. Thirdly, we enable data provenance and integrity control by tracking all source metadata and providing tools which check whether a dataset has been retracted, that is identified as erroneous. The resulting dataset is updated as new CMIP6 results become available and we provide a stable access point to allow automated downloads. Along with our accompanying website (cmip6.science.unimelb.edu.au), we believe this dataset provides a unique community resource, as well as allowing non‐specialists to access CMIP data in a new, user‐friendly way.


| INTRODUCTION
Coupled atmosphere-ocean earth system models are our most comprehensive representations of the climate system. Earth system models from around the globe are currently running as part of the Sixth Coupled Model Intercomparison Project (CMIP6, Eyring et al. (2016a). CMIP6 builds on the Fifth Coupled Model Intercomparison Project (CMIP5, Taylor et al. (2012)), the output of which is still widely used. Combined, these two projects represent our most comprehensive, physically based estimates of the coupled earth system's behaviour under a wide range of experiments.
However, CMIP5 and CMIP6 models are computationally expensive hence cannot be run for all applications of interest. To fill this gap, a number of so-called reduced complexity climate models (also referred to as 'simple climate models') have been developed . An important part of developing such models is calibration: deriving a set of parameters, which allows them to best replicate the behaviour of the more complex models, which participate in the CMIP experiments. In order to do this calibration, one must first process the output from CMIP5 and CMIP6.
Once complete, the CMIP6 archive will be one of the world's largest data archives, with an expected total volume in the region of 18PB (Balaji et al., 2018). Fortunately, reduced complexity climate modellers typically only require annual-mean or monthly model output on hemispheric and land/ocean scales. This significantly reduces the volume of data they must handle. Nonetheless, reduced complexity climate models typically include a number of different modules, which cover the full range of the emissions-climate change cause-effect chain. Calibrating all of these different modules requires handling several datasets, such that the total volume of raw CMIP output of interest to a 'comprehensive' reduced complexity climate model will still be of the order of 50TB. Processing a data volume of this size is an intimidating task, even for expert users.
The data processing is further complicated by five extra factors. The first is that the raw data are all in the custom, climate-specific netCDF data format (Unidata, 2020). Without specialist training, it cannot be read, let alone analysed. The second factor is that the data are all sorted according to a highly regularized data reference syntax (Balaji et al., 2018). Such regularization is required to make the data able to be processed by machines; however, it can be confusing for non-experts. The third factor is that data are typically presented in absolute values. However, reduced complexity climate models are typically perturbation models; that is, they calculate perturbations from some reference state rather than absolute values. Given the data volume and sometimes complex relationship between CMIP experiments, calculating such perturbations for a large data volume is not a trivial task. The fourth factor is licensing. All CMIP5 and CMIP6 files are released under a specific licence which users must adhere to, and retrieving this information is not easily done. The final factor is retractions, that is removal of data, which is later identified as erroneous. Such retractions are essential to avoid erroneous results propagating into the scientific literature. However, at the moment there are only a limited amount of tools which allow a user to check whether they are using a retracted dataset or not.
The target audience for this dataset is reduced complexity climate models and modellers. Here, reduced complexity models refer to models which focus on global-and annualmean properties of the climate system. As a result of their limited spatial and temporal resolution, reduced complexity models are very computationally efficient. These models are typically used when more complex models, such as those participating in CMIP, are too computationally expensive to use. For example, reduced complexity models are used within many integrated assessment models to assess the climate implications of different emissions pathways (given that integrated assessment models typically require hundreds to thousands of climate realizations, using CMIP models is not computationally feasible). Some prominent examples of reduced complexity models are MAGICC (Meinshausen et al., 2011), FaIR (Smith et al., 2018) and hector (Hartin et al., 2015). A detailed discussion of reduced complexity models and an overview of models available in the literature can be found in the first phase of the Reduced Complexity Model Intercomparison Project .
Our CMIP5-and CMIP6-derived dataset was extracted using the open source tool we developed, netCDF-SCM (netCDF handling for reduced complexity/simple climate modellers, see Section 2.2) and is ready for use by reduced complexity climate modellers. It is the result of addressing all the complications described above and includes global, hemispheric, land/ocean annual and monthly means of a variety of CMIP5 and CMIP6 output. Given the processing performed, this dataset is orders of magnitude smaller than the original data. The reduced data volume means that we can feasibly provide the data in a text-based format. Thus, while believe this dataset provides a unique community resource, as well as allowing nonspecialists to access CMIP data in a new, user-friendly way.

K E Y W O R D S
aggregate, climate, CMIP, model, projections the dataset is targeted at developers of reduced complexity climate models, its simple, text-based format also allows nonexpert users beyond the climate science community to read and analyse the data as they no longer need to engage with the climate-specific netCDF format.

| Datasets
To produce this derived dataset, we rely on the CMIP5 and CMIP6 archives (available via esgf-node.llnl.gov/searc h/ cmip5/ and esgf-node.llnl.gov/searc h/cmip6/, respectively, last accessed 25 June 2020), both of which rely on the Earth System Grid Federation (Williams et al., 2011;Cinquini et al., 2014;Williams et al., 2016). Accordingly, any use of our derived set must also follow the conditions of the original data source and users should cite the relevant work of each modelling group whose output they use. Full details for CMIP5 can be found at pcmdi.llnl.gov/mips/cmip5/ terms -of-use.html (last accessed 25 June 2020), and for CMIP6 at pcmdi.llnl.gov/CMIP6/ Terms OfUse/ Terms OfUse 6-1.html (last accessed 25 June 2020).
To date, we have processed data from over 400 different CMIP6 model-experiment pairs and over 125 CMIP5 model-experiment pairs (Tables 1 and 2). We have built a custom software package, netCDF-SCM (further details in Section 2.2), that can handle any dataset that can also be handled by Iris, 'a powerful, format-agnostic, communitydriven Python library for analysing and visualizing Earth science data' (Met Office, 2019). In practice, this restricts netCDF-SCM to handling CF-compliant datasets (details of CF-compliance available at http://cfcon venti ons.org/Data/ cf-conve ntion s/cf-conve ntion s-1.8/cf-conve ntions.html, last accessed: 25 June 2020). Having said that, netCDF-SCM is focussed on CMIP data and much of its value is based on the data reference syntax associated with CMIP output. The CMIP6 data reference syntax is available at github.com/ WCRP-CMIP/CMIP6_CVs (last accessed: 25 June 2020), whilst the CMIP5 data reference syntax is available at https:// pcmdi.llnl.gov/mips/cmip5/ docs/cmip5_data_refer ence_syntax.pdf (last accessed: 25 June 2020).
Here, we comply with the CMIP5 and CMIP6 conditions of use by providing a summary of the data sources we have processed to date in Tables 1 and 2 and including the required  acknowledgements statements. Tables 1 and 2 have been produced automatically from the source data using netCDF-SCM (documentation at netcd f-scm.readt hedocs.io/en/lates t/usage/ using -cmip-data.html, last accessed 25 June 2020). In addition, we provide tools to check whether a dataset has been retracted (see Section 2.2.3). We hope this automation can be useful to other researchers too and enhances other existing CMIP6 functionality given our extensive efforts to follow 'CMIP6 data management and integrity objectives' (Balaji et al., 2018).

| Methods
In principle, our derived dataset is easy to produce. It consists of weighted means of different horizontal area sections of the raw data, which are then joined together to create timeseries which span both the historical and future scenario simulations. In practice, the collection, joining and processing of the huge data volumes turns it into a non-trivial effort.
To address this in an automated way, we have developed the netCDF-SCM package gitlab.com/netcd f-scm/netcd f-scm, available under the open source initiative approved BSD-3-Clause licence (see https://opens ource.org/licenses and Morin et al. (2012) for more information on open source software licences for scientists). The netCDF-SCM package builds on the Iris package (Met Office, 2019) and relies heavily on the numerous capabilities Iris offers. We welcome the addition of new features to the netCDF-SCM package (its development is hosted at gitlab.com/netcd f-scm/netcd f-scm).
We have validated netCDF-SCM in a number of ways. Firstly, we have compared our global-mean calculations with the limited set of data available on the KNMI climate explorer (clime xp.knmi.nl/start.cgi, last accessed 7th October 2020) and found them to be within 0.1% in all cases. Secondly, we have run an extensive test suite over the entire netCDF-SCM package (the test suite is run with every update to the software in a process known as 'continuous integration', see, e.g., https://about.gitlab.com/blog/2018/01/22/a-begin ners-guide -to-conti nuous -integ ratio n/). This test suite includes unit tests (tests of individual bits of functionality in isolation from the rest of the package), integration tests (end-to-end tests of the behaviour of the package) and regression tests (tests of changes in outputs of the package compared to previous versions). Combined, these tests cover 98% of the codebase, with much of the package being run many times over as part of the testing process.

| Weighted means
We combine raw data, cell areas and cell surface fraction information to produce appropriately weighted means for each region of interest ( Figure 1). We calculate the weighted means as follows:  where v(r, t, � ⃗ Z ) is the output for horizontal region of interest r at time t and non-horizontal spatial coordinate vector � ⃗ Z, � ⃗ Y is the horizontal spatial coordinate of each cell, X(t, � ⃗ Y , � ⃗ Z ) is the raw data at time t in the cell with horizontal spatial coordinate vector � ⃗ Y and non-horizontal spatial coordinate vector � ⃗ Z, is the horizontal area of the cell, and f(r, � ⃗ Y , � ⃗ Z ) is the relevant surface fraction of the cell for the region of interest. Here, 'horizontal region' refers to a region defined by latitude and longitude only.
The cell area and surface fractions specific to the climate model of interest are determined using the CMIP data reference syntax and the CMIP output metadata. If the cell area and/or surface fraction information is not available, netCDF-SCM will simply use inbuilt best-guess cell areas and surface fractions. The best-guesses cell areas use Iris' (MetOffice, 2019) functionality to calculate cell areas based on each cell's bounds (cell areas are calculated as r 2 ( lon 1 − lon 0 ) * (sin (lat 1 ) − sin(lat 0 ) ), where r is the Earth's radius, lon 1 and lon 0 are the longitude (east-west) bounds of the cell, and lat 1 and lat 0 are the latitude (northsouth) bounds of the cell). The best-guess surface fractions are the result of interpolating the surface fractions from IPSL-CM6A-LR climate model historical simulations (Boucher et al., 2018g) onto a regular 1 × 1 grid. When calculating weighted averages, these best-guess surface fractions are then interpolated onto the model of interest's grid. This process works best with data on a regular grid. Given their complexity, on some models' native grids it may be necessary to obtain the cell area and surface fraction information before the data can be processed. We have examined the importance of the choice of best-guess surface fractions and found that basing our bestguess surface fractions on other climate models would make a negligible difference (<1%) to our results. In cases where the cell area and surface fraction are particularly important (e.g. for small regions), we stress that the most accurate results can only be obtained by using the cell area and surface fractions specific to the climate model of interest.
We also ensure that we use the surface fractions appropriate to the domain of the data of interest ( f in Equation (1)). For atmospheric data, we simply use surface land fractions for land regions and surface ocean fractions for ocean regions. For land-and ocean-specific data, more care is required because there are two different options. Assume we're considering land-specific data. The first option is to include the land surface fractions in the weights. This choice means that each cell is weighted by the area of land in the cell, not by the total area of the cell. The second is to only consider whether any of the cell is land. If any of the cell is land, then we set f in Equation (1) equal to one; otherwise, we set f equal to zero otherwise. This choice means that each cell is weighted by the total area of the cell, unless it is not land at all in which case it receives zero weight. The same logic can be applied to ocean-specific data, using ocean fractions instead of land fractions as required.
For land data, for example data from C4MIP (Jones et al., 2016), we use the first option; that is, we weight by area and Voldoire ( Danabasoglu (2019r) abrupt-4xCO2 Danabasoglu (2019s) historical Danabasoglu (2019t) piControl Danabasoglu (2019u) ssp126 Danabasoglu (2019v)  Danabasoglu (2019x) ssp370-lowNTCF Danabasoglu (2019q) ssp534-over Danabasoglu (2019y) ssp585 surface land fractions in our calculations. These surface fraction weights are important to apply because 'you do need to weight the output by land frac (sftlf is the CMIP variable name)' (Jones, 2020b) personalcomm in order to calculate weighted means correctly. In contrast, for ocean data we use the second option; that is, we weight only by area, having first assigned a weight of zero to any cells, which do not contain any ocean. We must apply this logic because ocean-specific data are given with respect to the horizontal area of the entire cell, not the horizontal area of ocean in the cell (Griffies et al., 2016). This step is the most computationally expensive step. Being built on Iris (Met Office, 2019), netCDF-SCM is able to handle large datasets, largely thanks to the dask package (Dask Development Team, 2016). With parallel processing and user-defined memory usage settings, netCDF-SCM is able to be run on personal computers as well as cloud high performance computing infrastructures.

| Stitching experiments
CMIP data typically come in a 'family' of experiments, with each experiment having a 'parent' experiment and potentially 'children', 'grandchildren', 'great-grandchildren', etc. Each 'generation' can be joined with the previous to make a longer, continuous timeseries.
An obvious example of this is the ScenarioMIP (O'Neill et al., 2016) experiments. ScenarioMIP includes simulations of future scenarios. In order to create a complete timeseries from pre-industrial times through to the future, the scenario simulations must be joined with the historical simulations (their parent experiment) and the pre-industrial control (pi-Control) simulations. The need for joining experiments in this way, or 'stitching' them, appears beyond scenario simulations too. For example, many ZECMIP (Jones et al., 2019) experiments are the children of the one per cent per year increase in atmospheric CO2 concentration experiments (1pctCO2).
Stitching experiments together back to pre-industrial control runs requires a number of steps (Algorithm 1). It is necessary to combine the metadata provided in each file with the data reference syntax to efficiently traverse the data archive to find each relevant output. Then, the branch times in each experiment must be checked and aligned to ensure that the continuous timeseries is as intended. For full provenance, the metadata from each generation that has contributed to the 'stitched' output should also be preserved. Performing these These stitched timeseries are a key part of our dataset. They contain many key climate signals, such as interannual (for monthly data) and multi-decadal variability.
However, there can be an extra step. The extra step is dedrifting or applying 'normalization'. In this context, normalization refers to calculating anomalies from some reference values. In some cases, this is trivial. For example, taking anomalies from a given period within the existing output, for example, calculating anomalies relative to the 1850-1900 period.
Nonetheless, there are many cases which require more complex analyses. For example, calculating anomalies relative to a 21-year (or 30-year) running mean of the equivalent period in the pre-industrial control run. Such a calculation requires finding the pre-industrial control data, identifying the equivalent period in the pre-industrial control run, correctly lining it up with the data to be normalized before finally calculating the anomalies. This can be done for single ensemble members ( Figure 2) and for multiple members (in which case particular care must be taken to normalize each ensemble member against the correct period from the pre-industrial control run, Figure S1). For small subsets of the data, this can be done manually; however, automated solutions are necessary to perform it over an entire CMIP archive.
Within our dataset, we currently offer four different normalization options (in addition to the outputs which have had no normalization applied). Before describing these options, we stress that all of our processing data are openly available so users who wish to use different normalization options to the ones provided are able to do so. Our four different normalization options are anomalies and de-drifting with 21-and 30-year running means as reference values. The anomalies are calculated as the difference between the model output and the corresponding running mean from the pre-industrial control experiment. In contrast, the dedrifting calculations simply remove any drift in the running mean of the pre-industrial control experiment, without calculating differences. The anomaly calculations are useful for variables such as tas (surface air temperature), where changes in the variable from the pre-industrial state are of most interest. The de-drifting calculations are used for variables such as cLand (total carbon in all terrestrial pools), where the absolute values are of importance, but it is also important to remove model drift before performing further analysis.
For all normalization options, we use an equation of the form: where v norm (r, t) is the normalized values for region r at time t, v raw (r, t) is the raw values and is v pi (r, t) the running-mean reference values from the pre-industrial control run (either absolute values or drift values depending on the normalization method). (2) v norm (r, t) = v raw (r, t) − v pi (r, t) 2.2.3 | Retracted data CMIP data are occasionally found to be erroneous and hence retracted (Balaji et al., 2018). To handle this, as part of netCDF-SCM, we provide a simple tool which checks if a data file is based on any retracted data (see https://netcd f-scm.readt hedocs.io/en/lates t/usage/ using -cmip-data.html, last accessed 25 June 2020). In addition, the tool will also examine the data's licence and, to the extent possible, warn the user about any non-standard licence terms.
These tools take advantage of CMIP6's 'dataset-centric rather than system centric' approach (Balaji et al., 2018). The dataset-centric approach allows data users to check the validity of their data at point of use, rather than relying on the data provider to have done this for them. The dataset-centric approach also ensures that 'dark repositories' (Balaji et al., 2018), such as our derived dataset, maintain the connection between data user and the original source of the data. In our derived dataset, we maintain this connection by providing the persistent identifiers of all datasets within our metadata (specifically the tracking_id attributes).

| DATASET ACCESS
The dataset described in this paper is openly available at https://doi.org/10.5281/zenodo.3951890, and all our data processing code is openly available at https://gitlab.com/ netcd f-scm/calib ratio n-data. Any use of the data must follow CMIP's terms of use (see discussion in Section 2.1).
The timeseries are continuous, monthly timeseries for specific variables, experiments and regions of interest for a selection of the CMIP5 and CMIP6 archives. These timeseries are the combination of the experiment of interest, any (potentially multiple) experiments from which it 'branched' and any normalization, which is applied (Section 2.2 and Figure 2).
Our dataset is a different product compared to the data available in the IPCC AR6 Atlas (https://github.com/Santa nderM etGro up/ATLAS, last accessed 9th October 2020). At present, the only similarity is that both our dataset and the Atlas provide surface air temperature (tas) at various regional aggregations for tier 1 experiments from CMIP5 and CMIP6. However, while the Atlas uses a binary land/ocean mask (i.e. each value is a one or a zero) and cosine of latitude as a proxy for area weights, we use the model reported cell areas (where available) and a continuous land fraction (including subtleties for land-only and ocean-only data, see Section 2.2.1) to calculate weighted, aggregate metrics. Secondly, we provided stitched outputs, joining each experiment with its parent, grandparent etc. experiments. Thirdly, the Atlas provides precipitation (pr) timeseries, whereas we do not provide precipitation. Instead, we provide data for 82 other variables as described previously. Finally, the Atlas provides one ensemble member per climate model, while we provide as many ensemble members as are available.
We provide our data in a comma separated value (csv) format. This format is composed of three key parts and uses F I G U R E 2 Joining of timeseries ('stitching') and calculation of anomalies from the pre-industrial control runs ('normalization'). Following CMIP6 terminology, we show here a 'family' of experiments, with G6solar being the child, ssp585 being the parent, historical being the grandparent and piControl being the great-grandparent. (a) raw data on its native time axes; (b) raw data alongside the pre-industrial control run, with the pre-industrial control run's time axis shifted so the branch point (i.e. the point in time at which an experiment branches from its parent, in this case the point in time at which the historical experiment branches from the piControl experiment) occurs at the same time in the pre-industrial control run and the historical experiment; (c) final stitched and normalized output. In this case, the stitched and normalized output comprises the historical experiment from 1850-2014, the ssp585 experiment from 2015-2019 and the G6solar experiment from 2020-2100 and has been normalized against a 21-year running mean of the pre-industrial control experiment (red line in panel b) the extension. MAG because it is directly compatible with the MAGICC7 reduced complexity climate model (Meinshausen et al., 2019). The first part is the header, which contains the date the file was written, the contact for the file, the version of netCDF-SCM used to crunch the data and the version of Pymagicc (Gieseke et al., 2018) used to write the file. The second is the metadata. This contains all metadata from each of the raw datafiles, plus extra information and metadata about the method used to derive the final timeseries included in the file. It also contains a FORTRAN90 name list with basic information about the data in the file. A particularly useful bit of information is the THISFILE_FIRSTDATAROW line, which allows automated readers to skip to the line of interest if they are only interested in the data. The third and final section is the data. The data block is composed of a fourline header with variable, units and region information for each timeseries as well as a MAGICC7-specific row, TODO, which can generally be ignored. After the header comes the data itself. The data block has column-oriented data, with the first column being the time axis (in years) and each subsequent column being a different timeseries (sometimes referred to as 'wide' data although this term is imprecise (Wickham, 2014).
The data archive grows as we add new CMIP6 results. An up-to-date full collection (alongside instructions for automated downloads) can be found at https://cmip6.scien ce.unime lb.edu.au. Examples of how to use the data can be found in https://gitlab.com/netcd f-scm/calib ratio n-data/-/tree/maste r/noteb ooks (last accessed 25 June 2020), and we encourage any users of the data to add further examples, especially in computing languages other than Python.

AND REUSE
The key users of this dataset are reduced complexity climate modellers. These regional-aggregate timeseries are a key part of model calibration (see, e.g., Meinshausen et al. (2011)) and comprehensive datasets allow reduced complexity models to be validated over a wide range of experiments and output variables.
Having said this, we believe that the dataset can be useful well-beyond the reduced complexity climate model community. As discussed in Section 1, processing CMIP data is an intimidating task for expert users and not possible for those without specialist training. We hope that our aggregate dataset removes this need for specialist training, thanks to its significantly reduced data volume and text-based format. As a result, the dataset presented here may be useful to climate change researchers outside the climate modelling community, policymakers, businesses and even journalists.
The dataset presented here comes with three important caveats. The first is that we make no guarantees about how up-to-date our data is. As discussed previously, the onus is on users of the data to check for retractions before using the data (see Section 2.2.3 for discussion of our automated tool for checking such retractions). The second is that the areaweighting used (Equation (1)) is only one of many possible area-weighting choices. For example, other users may wish to partition data into area/land boxes based on whether the fraction of each gridbox is above some threshold or not. At present, we do not provide data for area-weighting choices beyond the one described in this paper. For users who need to do such analysis, we are able to provide guidance on how this could be done with netCDF-SCM via netCDF-SCM's issue tracker (https://gitlab.com/netcd f-scm/netcd f-scm/-/issues). Thirdly, we provide only a limited number of ways of calculating anomalies. Again, for users who wish to calculate anomalies in a different way to what we have provided, netCDF-SCM's issue tracker can be used for discussions and guidance.
On the software side, the netCDF-SCM tool is in its relative infancy and is currently only developed by a limited community. As a result, many improvements could be made. We hope that netCDF-SCM's open source nature, with its extensive tests, invites contributions from throughout the climate community and beyond. Such contributions will improve netCDF-SCM's functionality and reduces the need for duplicate effort.
As a first suggestion, we note that much of netCDF-SCM's functionality is a duplication of functionality within the ESMValTool, 'A community diagnostic and performance metrics tool for routine evaluation of Earth system models in CMIP' (Eyring et al., 2016b). The duplication results from the parallel development of these two projects, which were both too immature to be combined when they were begun. Adding netCDF-SCM's functionality into the ESMValTool, which is a much bigger and better supported project, would reduce this duplication and likely provide benefits for both groups.
Outside of integration with the ESMValTool, improvements could be made to netCDF-SCM's memory usage, dask usage and parallelization. These may lead to significant processing performance as such optimizations, particularly the use of dask's task planning capabilities, have only been performed to a limited degree to date.

| CONCLUSIONS
We have presented a dataset of monthly, global, hemispheric and land/ocean means based on the CMIP5 and CMIP6 archives. The dataset is aimed at reduced complexity climate modellers, but may also be useful for many other researchers. Our dataset joins the different levels of experiments, reducing the need for users to manage and join multiple separate datasets before they can be used. This dataset is orders of magnitude smaller than the raw datasets themselves hence can be managed much more easily by non-expert users. In addition, we provide the dataset in a text-based format, which removes the need for users to be familiar with the netCDF format before they can use the data. We hope this facilitates | 175 NICHOLLS et aL. use by groups outside the science community, for example policymakers, actuaries and journalists.
We add new CMIP6 results to our dataset regularly and simultaneously remove any data derived from retracted datasets. We also provide simple tools to check whether a user's data have been retracted since they downloaded the data and to highlight any non-standard licence terms so that users can make sure they have the most up-to-date information possible.
We hope this can be a great community resource, which builds on the community efforts of the netCDF format (Unidata, 2020), Iris (Met Office, 2019) and generations of CMIP (Taylor et al., 2012;Eyring et al., 2016a).

ACKNOWLEDGEMENTS
We acknowledge the World Climate Research Programme, which, through its Working Group on Coupled Modelling, coordinated and promoted CMIP6. We thank the climate modelling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the data and providing access, and the multiple funding agencies who support CMIP6 and ESGF. We acknowledge the World Climate Research Programme's Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modelling groups (listed in Table 1 of this paper) for producing and making available their model output. For CMIP, the U.S. Department of Energy's Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. We thank the authors of Iris for all of their efforts in handling netCDF files. Without their efforts, netCDF-SCM would not have been possible. The work was undertaken in collaboration with the Melbourne Data Analytics Platform (MDAP) at the University of Melbourne. The work was also supported by Science IT at the University of Melbourne. This research/project was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government. ZN benefited from support provided by the ARC Centre of Excellence for Climate Extremes (CE170100023).

CONFLICTS OF INTEREST
The authors declare that they have no conflict of interest.