Internal datasets#

ETNA library contains several popular datasets that are often used in papers to estimate the quality of time series models. To load them you choose dataset name and use the following code:

from etna.datasets import load_dataset

ts = load_dataset(name="tourism_monthly", parts="full")

The first time, it will take some time to download and save the dataset locally, the next times the data will be read from a file. In the example above, we load tourism dataset with monthly frequency. We also use parts="full", which means that we load the full dataset (each dataset has predefined parts to load). For more details you can check etna.datasets.load_dataset() API reference.

List of internal datasets#

Datasets#

Dataset

Frequency

Shape

Time period

Exogenous data

Dataset parts

electricity_15T

15 minutes

140256 observations, 370 segments

(“2011-01-01 00:15:00”, “2015-01-01 00:00:00”)

No exog data

train, test, full

m3_monthly

monthly

144 observations, 1428 segments

int timestamp

Original timestamp column

train, test, full

m3_quarterly

quarterly

72 observations, 756 segments

int timestamp

Original timestamp column

train, test, full

m3_other

unknown, expected quarterly

104 observations, 174 segments

int timestamp

Original timestamp column

train, test, full

m3_yearly

yearly

47 observations, 645 segments

int timestamp

Original timestamp column

train, test, full

m4_hourly

hourly

1008 observations, 414 segments

int timestamp

No exog data

train, test, full

m4_daily

daily

9933 observations, 4227 segments

int timestamp

No exog data

train, test, full

m4_weekly

weekly

2610 observations, 359 segments

int timestamp

No exog data

train, test, full

m4_monthly

monthly

2812 observations, 48000 segments

int timestamp

No exog data

train, test, full

m4_quarterly

quarterly

874 observations, 24000 segments

int timestamp

No exog data

train, test, full

m4_yearly

daily

47 observations, 23000 segments

int timestamp

No exog data

train, test, full

traffic_2008_10T

10 minutes

65520 observations, 963 segments

(“2008-01-01 00:00:00”, “2009-03-30 23:50:00”)

No exog data

train, test, full

traffic_2008_hourly

hourly

10920 observations, 963 segments

(“2008-01-01 00:00:00”, “2009-03-30 23:00:00”)

No exog data

train, test, full

traffic_2015_hourly

hourly

17544 observations, 862 segments

(“2015-01-01 00:00:00”, “2016-12-31 23:00:00”)

No exog data

train, test, full

tourism_monthly

monthly

333 observations, 366 segments

int timestamp

Original timestamp column

train, test, full

tourism_quarterly

quarterly

130 observations, 427 segments

int timestamp

Original timestamp column

train, test, full

tourism_yearly

yearly

47 observations, 518 segments

int timestamp

Original timestamp column

train, test, full

weather_10T

10 minutes

52704 observations, 21 segments

(“2020-01-01 00:10:00”, “2021-01-01 00:00:00”)

No exog data

train, test, full

ETTm1

15 minutes

69680 observations, 7 segments

(“2016-07-01 00:00:00”, “2018-06-26 19:45:00”)

No exog data

train, test, full

ETTm2

15 minutes

69680 observations, 7 segments

(“2016-07-01 00:00:00”, “2018-06-26 19:45:00”)

No exog data

train, test, full

ETTh1

hourly

17420 observations, 7 segments

(“2016-07-01 00:00:00”, “2018-06-26 19:00:00”)

No exog data

train, test, full

ETTh2

hourly

17420 observations, 7 segments

(“2016-07-01 00:00:00”, “2018-06-26 19:00:00”)

No exog data

train, test, full

IHEPC_T

minute

2075259 observations, 7 segments

(“2006-12-16 17:24:00”, “2010-11-26 21:02:00”)

No exog data

full

australian_wine_sales_monthly

monthly

176 observations, 1 segments

(“1980-01-01 00:00:00”, “1994-08-01 00:00:00”)

No exog data

full

electricity dataset#

The electricity dataset is a 15 minutes time series of electricity consumption (in kW) of 370 customers. It has three parts:

Loading names:

  • electricity_15T with parts: train (139896 observations), test (360 observations), full (140256 observations)

References:

m3 dataset#

The M3 dataset is a collection of 3,003 time series used for the third edition of the Makridakis forecasting Competition. The M3 dataset consists of time series of yearly, quarterly, monthly and other data. Dataset with other data originally does not have any particular frequency, but we assume it as a quarterly data. Each frequency mode has its own specific prediction horizon: 6 for yearly, 8 for quarterly, 18 for monthly, and 8 for other.

M3 dataset has series ending on different dates. As to the specificity of TSDataset we use integer index to make series end on one timestamp.. Original dates are added as an exogenous data. For example, df_exog of train dataset has dates for train and test and df_exog of test dataset has dates only for test.

Loading names:

  • m3_monthly with parts: train (126 observations), test (18 observations), full (144 observations)

  • m3_quarterly with parts: train (64 observations), test (8 observations), full (72 observations)

  • m3_yearly with parts: train (41 observations), test (6 observations), full (47 observations)

  • m3_other with parts: train (96 observations), test (8 observations), full (104 observations)

References:

m4 dataset#

The M4 dataset is a collection of 100,000 time series used for the fourth edition of the Makridakis forecasting Competition. The M4 dataset consists of time series of yearly, quarterly, monthly and other (weekly, daily and hourly) data. Each frequency mode has its own specific prediction horizon: 6 for yearly, 8 for quarterly, 18 for monthly, 13 for weekly, 14 for daily and 48 for hourly.

Loading names:

  • m4_hourly with parts: train (960 observations), test (48 observations), full (1008 observations)

  • m4_daily with parts: train (9919 observations), test (14 observations), full (9933 observations)

  • m4_weekly with parts: train (2597 observations), test (13 observations), full (2610 observations)

  • m4_monthly with parts: train (2794 observations), test (18 observations), full (2812 observations)

  • m4_quarterly with parts: train (866 observations), test (8 observations), full (874 observations)

  • m4_yearly with parts: train (835 observations), test (6 observations), full (841 observations)

References:

traffic 2008 dataset#

15 months worth of daily data (440 daily records) that describes the occupancy rate, between 0 and 1, of different car lanes of the San Francisco bay area freeways across time. Data was collected by 963 sensors from Jan. 1st 2008 to Mar. 30th 2009 (15 days were dropped from this period: public holidays and two days with anomalies, we set zero values for these days). Initial dataset has 10 min frequency, we create traffic with hour frequency by mean aggregation. Each frequency mode has its own specific prediction horizon: 6 * 24 for 10T, 24 for hourly.

Loading names:

  • traffic_2008_10T with parts: train (65376 observations), test (144 observations), full (65520 observations)

  • traffic_2008_hourly with parts: train (10896 observations), test (24 observations), full (10920 observations)

References:

traffic 2015 dataset#

24 months worth of hourly data (24 daily records) that describes the occupancy rate, between 0 and 1, of different car lanes of the San Francisco bay area freeways across time. Data was collected by 862 sensors from Jan. 1st 2015 to Dec. 31th 2016. Dataset has prediction horizon: 24.

Loading names:

  • traffic_2015_hourly with parts: train (17520 observations), test (24 observations), full (17544 observations)

References:

tourism dataset#

Dataset contains 1311 series in three frequency modes: monthly, quarterly, yearly. They were supplied by both tourism bodies (such as Tourism Australia, the Hong Kong Tourism Board and Tourism New Zealand) and various academics, who had used them in previous tourism forecasting studies. Each frequency mode has its own specific prediction horizon: 4 for yearly, 8 for quarterly, 24 for monthly.

Tourism dataset has series ending on different dates. As to the specificity of TSDataset we use integer index to make series end on one timestamp. Original dates are added as an exogenous data. For example, df_exog of train dataset has dates for train and test and df_exog of test dataset has dates only for test.

Loading names:

  • tourism_monthly with parts: train (309 observations), test (24 observations), full (333 observations)

  • tourism_quarterly with parts: train (122 observations), test (8 observations), full (130 observations)

  • tourism_yearly with parts: train (43 observations), test (4 observations), full (47 observations)

References:

weather dataset#

Dataset contains 21 meteorological indicators in Germany, such as humidity and air temperature with a 10 min frequency for 2020. We use the last 24 hours as prediction horizon.

Loading names:

  • weather_10T with parts: train (52560 observations), test (144 observations), full (52704 observations)

References:

Electricity Transformer Datasets (ETT)#

Dataset consists of four parts: ETTh1 (hourly freq), ETTh2 (hourly freq), ETTm1 (15 min freq), ETTm2 (15 min freq). This dataset is a collection of two years of data from two regions of a province of China. There are one target column (“oil temperature”) and six different types of external power load features. We use the last 720 hours as prediction horizon.

Loading names:

  • ETTm1 with parts: train (66800 observations), test (2880 observations), full (69680 observations)

  • ETTm2 with parts: train (66800 observations), test (2880 observations), full (69680 observations)

  • ETTh1 with parts: train (16700 observations), test (720 observations), full (17420 observations)

  • ETTh2 with parts: train (16700 observations), test (720 observations), full (17420 observations)

References:

Individual household electric power consumption dataset#

This dataset consists of almost 4 years of history with 1 minute frequency from a household in Sceaux. Different electrical quantities and some sub-metering values are available.

Loading names:

  • IHEPC_T with parts: full (2075259 observations)

References:

Australian wine sales dataset#

This dataset consists of wine sales by Australian wine makers between Jan 1980 – Aug 1994.

Loading names:

  • australian_wine_sales_monthly with parts: full (176 observations)

References: