Tutorial: Data Management
Author: Tianyu Du (tianyudu@stanford.edu)
Note: please go through the introduction tutorial here before proceeding.
This notebook aims to help users understand the functionality of ChoiceDataset
object.
The ChoiceDataset
is an instance of the more general PyTorch dataset object holding information of consumer choices. The ChoiceDataset
offers easy, clean and efficient data management. The Jupyter-notebook version of this tutorial can be found here.
This tutorial provides in-depth explanations on how the torch-choice
library manages data. We are also providing an easy-to-use data wrapper converting long-format dataset to ChoiceDataset
here, you can harness the torch-choice
library without going through this tutorial.
Note: since this package was initially proposed for modelling consumer choices, attribute names of ChoiceDataset
are borrowed from the consumer choice literature.
Note: PyTorch uses the term tensor to denote high dimensional matrices, we will be using tensor and matrix interchangeably.
After walking through this tutorial, you should be abel to initiate a ChoiceDataset
object as the following and use it to manage data.
dataset = ChoiceDataset(
# pre-specified keywords of __init__
item_index=item_index, # required.
# optional:
user_index=user_index,
session_index=session_index,
item_availability=item_availability,
# additional keywords of __init__
user_obs=user_obs,
item_obs=item_obs,
session_obs=session_obs,
price_obs=price_obs)
Observables
Observables are tensors with specific shapes, we classify observables into four categories based on their variations.
Basic Usage
Optionally, the researcher can incorporate observables of, for example, users and items. Currently, the package support the following types of observables, where \(K_{...}\) denote the number of observables.
user_obs
\(\in \mathbb{R}^{U\times K_{user}}\): user observables such as user age.item_obs
\(\in \mathbb{R}^{I\times K_{item}}\): item observables such as item quality.session_obs
\(\in \mathbb{R}^{S \times K_{session}}\): session observable such as whether the purchase was made on weekdays.price_obs
\(\in \mathbb{R}^{S \times I \times K_{price}}\), price observables are values depending on both session and item such as the price of item.
The researcher should supply them with as appropriate keyword arguments while constructing the ChoiceDataset
object.
(Optional) Advanced Usage: Additional Observables
In some cases, the researcher have multiple sets of user (or item, or session, or price) observables, say user income (a scalar variable) and user market membership. The user income a matrix in \(\mathbb{R}^{U\times 1}\). Further, suppose there are four types of market membership: no-membership, silver-membership, gold-membership, and diamond-membership. The user market membership is a binary matrix in \(\{0, 1\}^{U\times 4}\) if we one-hot encode users' membership status.
In this case, the researcher can either
1. concatenate user_income
and user_market_membership
to a \(\mathbb{R}^{U\times (1+4)}\) matrix and supply it as a single user_obs
as the following:
user_income
\(\in \mathbb{R}^{U \times 1}\) matrix and a user_market_membership
\(\in \mathbb{R}^{U \times 4}\) matrix as the following:
dataset = ChoiceDataset(..., user_income=user_income, user_market_membership=user_market_membership, ...)
Supplying two separate sets of observables is particularly useful when the researcher wants different kinds of coefficients for different kinds of observables.
For example, the researcher wishes to model the utility for user \(u\) to purchase item \(i\) in session \(s\) as the following:
Please note that the \(\beta_i\) coefficient has an \(i\) subscript, which means it's item specific. The \(\gamma\) coefficient has no subscript, which means it's the same for all items.
The coefficient for user income is item-specific so that it captures the nature of the product (i.e., a luxury or an essential good). Additionally, the utility representation admits an user market membership becomes shoppers with active memberships tend to purchase more, and the coefficient of this term is constant across all items.
As we will cover later in the modelling section, we need to supply two user observable tensors in this case for the model to build coefficient with different levels of variations (i.e., item-specific coefficients versus constant coefficients). In this case, the researcher needs to supply two tensors user_income
and user_market_membership
as keyword arguments to the ChoiceDataset
constructor.
Generally, the ChoiceDataset
handles multiple user/item/session/price observables internally, the ChoiceDataset
class identifies the variation of observables by their prefixes. For example, every keyword arguments passed into ChoiceDataset
with name starting with item_
(except for the reserved item_availability
) will be treated as item observable tensors.
Similarly, all keywords with names starting user_
, session_
and price_
(except for reserved names like user_index
and session_index
mentioned above) will be interpreted as user/session/price observable tensors.
# import required dependencies.
import numpy as np
import pandas as pd
import torch
from torch_choice.data import ChoiceDataset, JointDataset
# let's get a helper
def print_dict_shape(d):
for key, val in d.items():
if torch.is_tensor(val):
print(f'dict.{key}.shape={val.shape}')
Creating ChoiceDataset
Object
# Feel free to modify it as you want.
num_users = 10
num_items = 4
num_sessions = 500
length_of_dataset = 10000
Step 1: Generate some random purchase records and observables
We will be creating a randomly generated dataset with 10000 purchase records from 10 users, 4 items and 500 sessions.
We use the term purchase record to denote the observation in the dataset due to the convention in Stata documentation (because observation meant something else in the Stata documentation and we don't want to confuse existing Stata users).
As mentioned in the introduction tutorial, one purchase record consists of who (i.e., user) bought what (i.e., item) when and where (i.e., session).
The length of the dataset equals the number of purchase records in it.
The first step is to randomly generate the purchase records using the following code. For simplicity, we assume all items are available in all sessions.
# create observables/features, the number of parameters are arbitrarily chosen.
# generate 128 features for each user, e.g., race, gender.
user_obs = torch.randn(num_users, 128)
# generate 64 features for each user, e.g., quality.
item_obs = torch.randn(num_items, 64)
# generate 10 features for each session, e.g., weekday indicator.
session_obs = torch.randn(num_sessions, 10)
# generate 12 features for each session user pair, e.g., the budget of that user at the shopping day.
price_obs = torch.randn(num_sessions, num_items, 12)
We then generate random observable tensors for users, items, sessions and price observables, the size of observables of each type (i.e., the last dimension in the shape) is arbitrarily chosen.
Notes on Encodings Since we will be using PyTorch to train our model, we represent their identities with consecutive integer values instead of the raw human-readable names of items (e.g., Dell 24-inch LCD monitor). Similarly, you would need to encode user indices and session indices as well. Raw item names can be encoded easily with sklearn.preprocessing.LabelEncoder (The sklearn.preprocessing.OrdinalEncoder works as well).
item_index = torch.LongTensor(np.random.choice(num_items, size=length_of_dataset))
user_index = torch.LongTensor(np.random.choice(num_users, size=length_of_dataset))
session_index = torch.LongTensor(np.random.choice(num_sessions, size=length_of_dataset))
# assume all items are available in all sessions.
item_availability = torch.ones(num_sessions, num_items).bool()
Step 2: Initialize the ChoiceDataset
.
You can construct a choice set using the following code, which manage all information for you.
dataset = ChoiceDataset(
# pre-specified keywords of __init__
item_index=item_index, # required.
# optional:
user_index=user_index,
session_index=session_index,
item_availability=item_availability,
# additional keywords of __init__
user_obs=user_obs,
item_obs=item_obs,
session_obs=session_obs,
price_obs=price_obs)
What you can do with the ChoiceDataset
?
print(dataset)
and dataset.__str__
The command print(dataset)
will provide a quick overview of shapes of tensors included in the object as well as where the dataset is located (i.e., host memory or GPU memory).
ChoiceDataset(label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], price_obs=[500, 4, 12], device=cpu)
dataset.summary()
The summary
method provides preliminary summarization of the dataset.
4 1038
8 1035
5 1024
1 1010
2 997
0 990
6 981
9 980
3 974
7 971
dtype: int64
0 2575
1 2539
2 2467
3 2419
dtype: int64
ChoiceDataset with 500 sessions, 4 items, 10 users, 10000 purchase records (observations) .
The most frequent user is 4 with 1038 observations; the least frequent user is 7 with 971 observations; on average, there are 1000.00 observations per user.
5 most frequent users are: 4(1038 times), 8(1035 times), 5(1024 times), 1(1010 times), 2(997 times).
5 least frequent users are: 7(971 times), 3(974 times), 9(980 times), 6(981 times), 0(990 times).
The most frequent item is 0, it was chosen 2575 times; the least frequent item is 3 it was 2419 times; on average, each item was purchased 2500.00 times.
4 most frequent items are: 0(2575 times), 1(2539 times), 2(2467 times), 3(2419 times).
4 least frequent items are: 3(2419 times), 2(2467 times), 1(2539 times), 0(2575 times).
Attribute Summaries:
Observable Tensor 'user_obs' with shape torch.Size([10, 128])
0 1 2 3 4 5 \
count 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000
mean 0.687878 -0.339077 -0.375829 0.086242 0.250604 -0.344643
std 0.738520 1.259936 0.844018 0.766233 0.802785 0.645239
min -0.578577 -2.135251 -1.335928 -0.911508 -1.396776 -1.519729
25% 0.264708 -0.889820 -0.845100 -0.414891 -0.132619 -0.699887
50% 0.902505 -0.603065 -0.638757 -0.289223 0.297693 -0.405371
75% 1.155211 0.021188 -0.190907 0.712183 0.768554 0.117107
max 1.623162 2.217712 1.624211 1.252059 1.273116 0.571998
6 7 8 9 ... 118 119 \
count 10.000000 10.000000 10.000000 10.000000 ... 10.000000 10.000000
mean 0.423672 0.325855 0.258114 -0.199072 ... -0.165618 -0.378175
std 1.304160 0.815934 0.938925 1.344848 ... 1.135625 0.940863
min -1.440672 -1.068176 -1.280547 -2.819688 ... -1.567793 -1.604171
25% -0.535055 0.051598 -0.178302 -0.801871 ... -1.114392 -1.066492
50% 0.502826 0.369002 0.230939 -0.576039 ... -0.114789 -0.587483
75% 1.227700 0.899518 0.740881 0.820789 ... 0.602045 0.160254
max 2.462891 1.440098 1.828760 1.866570 ... 1.854828 1.386001
120 121 122 123 124 125 \
count 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000
mean -0.557321 0.402392 -0.070746 -0.770201 0.594842 0.572671
std 1.128886 0.899030 0.757537 1.044478 0.956856 0.883374
min -3.131332 -0.907885 -1.296398 -2.159384 -1.244177 -0.462607
25% -0.834223 -0.059528 -0.222124 -1.332558 0.234198 -0.008799
50% -0.613761 0.117478 -0.109676 -0.984450 0.656855 0.466357
75% 0.040239 1.136383 0.416972 -0.285216 1.246513 0.772441
max 1.087999 1.757588 1.022053 1.486507 2.010775 2.162550
126 127
count 10.000000 10.000000
mean 0.226993 -0.064205
std 1.463179 0.602277
min -1.731004 -0.865115
25% -0.951169 -0.418553
50% 0.174763 -0.112277
75% 0.773072 0.353951
max 2.991696 0.804881
[8 rows x 128 columns]
Observable Tensor 'item_obs' with shape torch.Size([4, 64])
0 1 2 3 4 5 6 \
count 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000
mean 0.287015 -0.180256 -0.239000 0.169168 0.159036 0.385342 -1.142672
std 1.339318 1.603530 0.722772 0.473407 0.392562 1.327739 0.566069
min -1.138152 -2.212473 -1.051363 -0.538771 -0.330795 -0.517352 -1.770297
25% -0.558802 -0.990083 -0.745828 0.132031 -0.006671 -0.485835 -1.397787
50% 0.170810 -0.012201 -0.154058 0.385432 0.174086 -0.125969 -1.199654
75% 1.016628 0.797626 0.352770 0.422569 0.339793 0.745208 -0.944538
max 1.944591 1.515852 0.403479 0.444577 0.618768 2.310656 -0.401083
7 8 9 ... 54 55 56 \
count 4.000000 4.000000 4.000000 ... 4.000000 4.000000 4.000000
mean 0.581071 -0.169341 0.076562 ... 0.055457 -0.002887 -0.160406
std 0.972295 0.978922 1.116274 ... 0.777132 0.903879 1.140101
min -0.596834 -1.309131 -1.563906 ... -0.481757 -0.997574 -1.721709
25% -0.025344 -0.718815 -0.153971 ... -0.442894 -0.340660 -0.631280
50% 0.745386 -0.177989 0.514336 ... -0.240767 -0.105541 0.117918
75% 1.351801 0.371485 0.744870 ... 0.257583 0.232232 0.588793
max 1.430348 0.987744 0.841483 ... 1.185118 1.197110 0.844249
57 58 59 60 61 62 63
count 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000
mean 0.149579 0.199678 0.088542 -0.356379 1.004674 0.095064 -0.548665
std 0.963564 0.744614 1.170228 0.833992 0.559029 0.912057 0.730697
min -0.760765 -0.419252 -1.038935 -0.989042 0.442226 -0.989018 -1.445138
25% -0.268040 -0.383280 -0.604213 -0.970008 0.592259 -0.492793 -0.790356
50% -0.075941 0.036190 -0.142981 -0.611959 0.966522 0.230826 -0.546745
75% 0.341678 0.619148 0.549774 0.001670 1.378937 0.818683 -0.305054
max 1.510964 1.145585 1.679067 0.787444 1.643426 0.907622 0.343970
[8 rows x 64 columns]
Observable Tensor 'session_obs' with shape torch.Size([500, 10])
0 1 2 3 4 5 \
count 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000
mean -0.025211 -0.018355 -0.002907 0.091295 -0.061911 -0.046364
std 0.976283 1.029875 0.959884 0.968500 1.020114 1.010222
min -2.642895 -3.091050 -3.572037 -2.406249 -3.147900 -3.357277
25% -0.745162 -0.685578 -0.636044 -0.629955 -0.754234 -0.732924
50% -0.018775 0.017807 -0.018642 0.112322 -0.090321 -0.070502
75% 0.652438 0.646001 0.601829 0.722870 0.640275 0.652521
max 3.044069 3.191774 2.521059 2.695970 3.166039 2.714594
6 7 8 9
count 500.000000 500.000000 500.000000 500.000000
mean 0.000907 0.001370 0.070499 -0.007936
std 1.015561 1.032878 1.036212 0.936091
min -2.677915 -3.489751 -2.953354 -2.424499
25% -0.679291 -0.671086 -0.582997 -0.681405
50% 0.002569 -0.009368 0.087901 0.010856
75% 0.703671 0.732814 0.737692 0.618773
max 2.528283 3.259835 2.827300 2.492085
Observable Tensor 'price_obs' with shape torch.Size([500, 4, 12])
device=cpu
dataset.num_{users, items, sessions}
You can use the num_{users, items, sessions}
attribute to obtain the number of users, items, and sessions, they are determined automatically from the {user, item, session}_obs
tensors provided while initializing the dataset object.
Note: the print =:
operator requires Python3.8 or higher, you can remove =:
if you are using an earlier copy of Python.
print(f'{dataset.num_users=:}')
print(f'{dataset.num_items=:}')
print(f'{dataset.num_sessions=:}')
print(f'{len(dataset)=:}')
dataset.num_users=10
dataset.num_items=4
dataset.num_sessions=500
len(dataset)=10000
dataset.clone()
The ChoiceDataset
offers a clone
method allow you to make copy of the dataset, you can modify the cloned dataset arbitrarily without changing the original dataset.
# clone
print(dataset.item_index[:10])
dataset_cloned = dataset.clone()
dataset_cloned.item_index = 99 * torch.ones(num_sessions)
print(dataset_cloned.item_index[:10])
print(dataset.item_index[:10]) # does not change the original dataset.
tensor([2, 2, 3, 1, 3, 2, 2, 1, 0, 1])
tensor([99., 99., 99., 99., 99., 99., 99., 99., 99., 99.])
tensor([2, 2, 3, 1, 3, 2, 2, 1, 0, 1])
dataset.to('cuda')
and dataset._check_device_consistency()
.
One key advantage of the torch_choice
and bemb
is their compatibility with GPUs, you can easily move tensors in a ChoiceDataset
object between host memory (i.e., cpu memory) and device memory (i.e., GPU memory) using dataset.to()
method.
Please note that the following code runs only if your machine has a compatible GPU and GPU-compatible version of PyTorch installed.
Similarly, one can move data to host-memory using dataset.to('cpu')
.
The dataset also provides a dataset._check_device_consistency()
method to check if all tensors are on the same device.
If we only move the label
to cpu without moving other tensors, this will result in an error message.
# move to device
print(f'{dataset.device=:}')
print(f'{dataset.device=:}')
print(f'{dataset.user_index.device=:}')
print(f'{dataset.session_index.device=:}')
dataset = dataset.to('cuda')
print(f'{dataset.device=:}')
print(f'{dataset.item_index.device=:}')
print(f'{dataset.user_index.device=:}')
print(f'{dataset.session_index.device=:}')
dataset.device=cpu
dataset.device=cpu
dataset.user_index.device=cpu
dataset.session_index.device=cpu
dataset.device=cuda:0
dataset.item_index.device=cuda:0
dataset.user_index.device=cuda:0
dataset.session_index.device=cuda:0
# # NOTE: this cell will result errors, this is intentional.
dataset.item_index = dataset.item_index.to('cpu')
dataset._check_device_consistency()
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-56-40d626c6d436> in <module>
1 # # NOTE: this cell will result errors, this is intentional.
2 dataset.item_index = dataset.item_index.to('cpu')
----> 3 dataset._check_device_consistency()
~/Development/torch-choice/torch_choice/data/choice_dataset.py in _check_device_consistency(self)
180 devices.append(val.device)
181 if len(set(devices)) > 1:
--> 182 raise Exception(f'Found tensors on different devices: {set(devices)}.',
183 'Use dataset.to() method to align devices.')
184
Exception: ("Found tensors on different devices: {device(type='cuda', index=0), device(type='cpu')}.", 'Use dataset.to() method to align devices.')
# create dictionary inputs for model.forward()
# collapse to a dictionary object.
print_dict_shape(dataset.x_dict)
dict.user_obs.shape=torch.Size([10000, 4, 128])
dict.item_obs.shape=torch.Size([10000, 4, 64])
dict.session_obs.shape=torch.Size([10000, 4, 10])
dict.price_obs.shape=torch.Size([10000, 4, 12])
Subset method
One can use dataset[indices]
with indices
as an integer-valued tensor or array to get the corresponding rows of the dataset.
The example code block below queries the 6256-th, 4119-th, 453-th, 5520-th, and 1877-th row of the dataset object.
The item_index
, user_index
, session_index
of the resulted subset will be different from the original dataset, but other tensors will be the same.
# __getitem__ to get batch.
# pick 5 random sessions as the mini-batch.
dataset = dataset.to('cpu')
indices = torch.Tensor(np.random.choice(len(dataset), size=5, replace=False)).long()
print(indices)
subset = dataset[indices]
print(dataset)
print(subset)
# print_dict_shape(subset.x_dict)
# assert torch.all(dataset.x_dict['price_obs'][indices, :, :] == subset.x_dict['price_obs'])
# assert torch.all(dataset.item_index[indices] == subset.item_index)
tensor([1118, 976, 1956, 290, 8283])
ChoiceDataset(label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], price_obs=[500, 4, 12], device=cpu)
ChoiceDataset(label=[], item_index=[5], user_index=[5], session_index=[5], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], price_obs=[500, 4, 12], device=cpu)
The subset method internally creates a copy of the datasets so that any modification applied on the subset will not be reflected on the original dataset. The researcher can feel free to do in-place modification to the subset.
print(subset.item_index)
print(dataset.item_index[indices])
subset.item_index += 1 # modifying the batch does not change the original dataset.
print(subset.item_index)
print(dataset.item_index[indices])
tensor([0, 1, 0, 0, 0])
tensor([0, 1, 0, 0, 0])
tensor([1, 2, 1, 1, 1])
tensor([0, 1, 0, 0, 0])
print(subset.item_obs[0, 0])
print(dataset.item_obs[0, 0])
subset.item_obs += 1
print(subset.item_obs[0, 0])
print(dataset.item_obs[0, 0])
tensor(-1.5811)
tensor(-1.5811)
tensor(-0.5811)
tensor(-1.5811)
140339656298640
140339656150528
Using Pytorch dataloader for the training loop.
The ChoiceDataset
object natively support batch samplers from PyTorch. For demonstration purpose, we turned off the shuffling option.
from torch.utils.data.sampler import BatchSampler, SequentialSampler, RandomSampler
shuffle = False # for demonstration purpose.
batch_size = 32
# Create sampler.
sampler = BatchSampler(
RandomSampler(dataset) if shuffle else SequentialSampler(dataset),
batch_size=batch_size,
drop_last=False)
dataloader = torch.utils.data.DataLoader(dataset,
sampler=sampler,
num_workers=1,
collate_fn=lambda x: x[0],
pin_memory=(dataset.device == 'cpu'))
print(f'{item_obs.shape=:}')
item_obs_all = item_obs.view(1, num_items, -1).expand(len(dataset), -1, -1)
item_obs_all = item_obs_all.to(dataset.device)
item_index_all = item_index.to(dataset.device)
print(f'{item_obs_all.shape=:}')
item_obs.shape=torch.Size([4, 64])
item_obs_all.shape=torch.Size([10000, 4, 64])
for i, batch in enumerate(dataloader):
first, last = i * batch_size, min(len(dataset), (i + 1) * batch_size)
idx = torch.arange(first, last)
assert torch.all(item_obs_all[idx, :, :] == batch.x_dict['item_obs'])
assert torch.all(item_index_all[idx] == batch.item_index)
torch.Size([16, 4, 64])
dict.user_obs.shape=torch.Size([10000, 4, 128])
dict.item_obs.shape=torch.Size([10000, 4, 64])
dict.session_obs.shape=torch.Size([10000, 4, 10])
dict.price_obs.shape=torch.Size([10000, 4, 12])
10000
Chaining Multiple Datasets: JointDataset
Examples
dataset1 = dataset.clone()
dataset2 = dataset.clone()
joint_dataset = JointDataset(the_dataset=dataset1, another_dataset=dataset2)
JointDataset with 2 sub-datasets: (
the_dataset: ChoiceDataset(label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], price_obs=[500, 4, 12], device=cpu)
another_dataset: ChoiceDataset(label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], price_obs=[500, 4, 12], device=cpu)
)