Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/Examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,8 @@ There are 2 additional steps to use DeepCTR with sequence feature input.
1. Generate the paded and encoded sequence feature of sequence input feature(**value 0 is for padding**).
2. Generate config of sequence feature with [VarLenSparseFeat](./Features.html#varlensparsefeat)

For more details about padding, `maxlen`, embedding sharing, DIN history fields, DSIN session fields, and TFRecord inputs, see the <a href="Sequence-Cookbook.html">Sequence Feature Cookbook</a>.

This example shows how to use ``DeepFM`` with sequence(multi-value) feature. You can get the demo data
[movielens_sample.txt](https://github.com/shenweichen/DeepCTR/tree/master/examples/movielens_sample.txt) and run the
following codes.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/Quick-Start.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ data[dense_features] = mms.fit_transform(data[dense_features])
For sparse features, we transform them into dense vectors by embedding techniques.
For dense numerical features, we concatenate them to the input tensors of fully connected layer.

And for varlen(multi-valued) sparse features,you can use [VarlenSparseFeat](./Features.html#varlensparsefeat). Visit [examples](./Examples.html#multi-value-input-movielens) of using `VarlenSparseFeat`
And for varlen(multi-valued) sparse features,you can use [VarlenSparseFeat](./Features.html#varlensparsefeat). Visit [examples](./Examples.html#multi-value-input-movielens) of using `VarlenSparseFeat` and the <a href="Sequence-Cookbook.html">Sequence Feature Cookbook</a> for sequence model input conventions.

- Label Encoding
```python
Expand Down
354 changes: 354 additions & 0 deletions docs/source/Sequence-Cookbook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,354 @@
# Sequence Feature Cookbook

This cookbook collects the input conventions for multi-value features and sequence models such as DIN, BST, DIEN, and DSIN.

## One Row Means One Prediction Target

DeepCTR models expect each training sample to be one row. The row contains:

- the candidate feature to score, such as `item_id` or `cate_id`
- user/context features, such as `user`, `gender`, or `pay_score`
- optional history or session features that describe the user's past behavior
- one label for this candidate row

For example, if the same user has two candidate interactions, build two rows. The history fields can be repeated or updated per row.

```text
user item_id cate_id hist_item_id hist_cate_id label
0 1 1 [1, 2, 3, 0] [1, 2, 2, 0] 1
0 2 2 [1, 2, 3, 1] [1, 2, 2, 1] 0
```

History fields are context for the row. The label belongs to the candidate row, not to every element inside the history.

## Feature Column Shapes

Use `SparseFeat` for one categorical id per row.

```python
SparseFeat("item_id", vocabulary_size=item_count + 1, embedding_dim=8)
```

Its input shape is usually `(batch_size,)` or `(batch_size, 1)`.

Use `DenseFeat` for numerical values or dense vectors.

```python
DenseFeat("pay_score", 1)
DenseFeat("article_vector", 128)
```

Use `VarLenSparseFeat` for a list of categorical ids per row.

```python
VarLenSparseFeat(
SparseFeat("genres", vocabulary_size=genre_count + 1, embedding_dim=4),
maxlen=max_genre_len,
combiner="mean",
)
```

Its input shape is `(batch_size, maxlen)`. Values must be padded to the same `maxlen` before being passed to the model.

## Padding and Length

For `VarLenSparseFeat`, value `0` is the default padding value. Do not use `0` as a valid category id when `length_name` is not set.

```python
genres = np.array([
[1, 3, 0, 0],
[2, 5, 8, 0],
])
```

If you pass `length_name`, DeepCTR uses that length feature to build the pooling mask.

```python
VarLenSparseFeat(
SparseFeat("hist_item_id", vocabulary_size=item_count + 1, embedding_dim=8),
maxlen=4,
length_name="seq_length",
)
```

Then add `seq_length` to the model input:

```python
model_input = {
"hist_item_id": hist_item_id,
"seq_length": np.array([3, 2, 4]),
}
```

Padding is still useful because tensors in one batch need the same shape.

## Multi-Value Feature or Multi-Hot Vector

`VarLenSparseFeat` stores categorical ids and then applies embedding lookup plus pooling. `maxlen` is the maximum number of values in one sample. It is not the vocabulary size.

For a `genres` feature with 18 possible genres and at most 5 genres per movie:

```python
VarLenSparseFeat(
SparseFeat("genres", vocabulary_size=18 + 1, embedding_dim=4),
maxlen=5,
combiner="mean",
)
```

A multi-hot vector is also possible, but then it is a dense vector with length equal to the vocabulary size:

```python
DenseFeat("genres_multihot", 18)
```

The two representations are different. `VarLenSparseFeat` learns an embedding for every genre id and pools only the present genres. A multi-hot vector feeds the raw indicator vector into the dense part of the model. For large vocabularies, `VarLenSparseFeat` is usually more memory efficient and easier to share with other categorical fields.

## Multiple VarLenSparseFeat Fields

Add multiple variable-length fields to the same feature column list. Each field can have its own vocabulary and `maxlen`.

```python
feature_columns = [
SparseFeat("user", user_count + 1, embedding_dim=8),
SparseFeat("item_id", item_count + 1, embedding_dim=8),
VarLenSparseFeat(
SparseFeat("genres", genre_count + 1, embedding_dim=4),
maxlen=max_genre_len,
combiner="mean",
),
VarLenSparseFeat(
SparseFeat("tags", tag_count + 1, embedding_dim=4),
maxlen=max_tag_len,
combiner="mean",
),
]
```

The model input must include one padded array for each field:

```python
model_input = {
"user": user,
"item_id": item_id,
"genres": genres,
"tags": tags,
}
```

## Sharing Embeddings

Use the same `embedding_name` when two features represent ids from the same dictionary and should share one embedding table.

```python
feature_columns = [
SparseFeat("item_id", item_count + 1, embedding_dim=8),
VarLenSparseFeat(
SparseFeat(
"hist_item_id",
item_count + 1,
embedding_dim=8,
embedding_name="item_id",
),
maxlen=4,
),
]
```

Use different `embedding_name` values when the fields are semantically different, even if they have the same value range.

## DIN and BST History Feature Names

DIN and BST use `history_feature_list` to find the history sequence that should be matched with the current candidate feature.

```python
behavior_feature_list = ["item_id", "cate_id"]
```

For every name in `behavior_feature_list`, the history sequence feature must be named with the `hist_` prefix:

```python
SparseFeat("item_id", item_count + 1, embedding_dim=8)
SparseFeat("cate_id", cate_count + 1, embedding_dim=4)

VarLenSparseFeat(
SparseFeat("hist_item_id", item_count + 1, embedding_dim=8, embedding_name="item_id"),
maxlen=4,
length_name="seq_length",
)
VarLenSparseFeat(
SparseFeat("hist_cate_id", cate_count + 1, embedding_dim=4, embedding_name="cate_id"),
maxlen=4,
length_name="seq_length",
)
```

If `behavior_feature_list = ["item"]`, the expected history feature name is `hist_item`. If `behavior_feature_list = ["item_id"]`, the expected history feature name is `hist_item_id`.

Inside DIN:

- `varlen_sparse_feature_columns` means all `VarLenSparseFeat` fields passed to the model.
- `history_feature_columns` means the `VarLenSparseFeat` fields whose names match `hist_` + a behavior feature name. These fields are used as keys in attention.
- `sparse_varlen_feature_columns` means other `VarLenSparseFeat` fields. They are pooled and appended to the DNN input, but they are not used as the DIN attention history.

For example, `genres` is a non-history multi-value field:

```python
VarLenSparseFeat(
SparseFeat("genres", genre_count + 1, embedding_dim=4),
maxlen=max_genre_len,
combiner="mean",
)
```

It belongs to `sparse_varlen_feature_columns`, not to `history_feature_columns`.

## DIN Does Not Support VarLen of VarLen

DIN expects the candidate behavior feature to be one sparse id per row, and the history behavior feature to be one padded sequence per row.

Supported:

```text
item_id: 3
hist_item_id: [1, 2, 4, 0]
```

Not supported directly:

```text
item_categories: [1, 2, 0]
hist_item_categories: [[1, 2, 0], [3, 4, 0], [5, 0, 0]]
```

The second case is a "sequence of multi-value behaviors", or a 3D tensor. DeepCTR's current `DIN` implementation does not provide a `VarLenSparseFeat` nested inside another `VarLenSparseFeat`.

Common workarounds are:

- choose one representative category for the candidate and each history item
- map a category set to one categorical id before training
- pre-pool the category set outside the model and build a custom DIN variant
- customize the model to handle a 3D behavior tensor

## Dense History Features in DIN

Current DIN attention is built from sparse embedding sequences. Dense features such as `pay_score` are appended to the DNN input, but they are not part of the attention keys by default.

If you need dense history values inside attention, common choices are:

- discretize the dense value into buckets and use it as a sparse sequence feature
- concatenate or combine dense history representations before passing them to a custom attention layer
- build a custom model based on `deepctr.models.sequence.din.DIN`

DeepCTR does not provide a `VarLenDenseFeat` class.

## DSIN Input Format

DeepCTR's DSIN implementation expects sessions to be prepared before training. It does not split a raw event stream into sessions inside the model.

For `behavior_feature_list = ["item", "cate_id"]` and `sess_max_count=2`, prepare fields like:

```python
feature_columns = [
SparseFeat("item", item_count + 1, embedding_dim=4),
SparseFeat("cate_id", cate_count + 1, embedding_dim=4),
VarLenSparseFeat(
SparseFeat("sess_0_item", item_count + 1, embedding_dim=4, embedding_name="item"),
maxlen=4,
),
VarLenSparseFeat(
SparseFeat("sess_0_cate_id", cate_count + 1, embedding_dim=4, embedding_name="cate_id"),
maxlen=4,
),
VarLenSparseFeat(
SparseFeat("sess_1_item", item_count + 1, embedding_dim=4, embedding_name="item"),
maxlen=4,
),
VarLenSparseFeat(
SparseFeat("sess_1_cate_id", cate_count + 1, embedding_dim=4, embedding_name="cate_id"),
maxlen=4,
),
]
```

The model input should contain one padded sequence per session field and `sess_length`, the number of valid sessions for each row:

```python
model_input = {
"item": item,
"cate_id": cate_id,
"sess_0_item": sess_0_item,
"sess_0_cate_id": sess_0_cate_id,
"sess_1_item": sess_1_item,
"sess_1_cate_id": sess_1_cate_id,
"sess_length": np.array([2, 1, 0]),
}
```

Usually you keep only the most recent `sess_max_count` sessions and the most recent `maxlen` events in each session, then pad the rest with `0`.

## DSIN Labels and Sessions

Each row still has one label for the candidate item in that row.

```text
user item cate_id sess_0_item sess_1_item sess_length label
0 1 1 [4, 5, 0, 0] [2, 3, 0, 0] 2 1
0 2 2 [4, 5, 1, 0] [2, 3, 0, 0] 2 0
```

The same user can appear in multiple rows. Sessions are historical context for the candidate row. They do not have their own labels inside the DSIN input.

If your raw data contains timestamps or time deltas, use them during preprocessing to split events into sessions. If you want time gaps as model features, bucket them into categorical ids or build a custom model. DeepCTR does not provide `VarLenDenseFeat` for raw variable-length dense time values.

## Estimator and TFRecord Inputs

The Keras-style DeepCTR models use `SparseFeat`, `DenseFeat`, and `VarLenSparseFeat`. The Estimator models use TensorFlow `tf.feature_column` objects directly.

For TFRecord vector features, make the dtype and feature column type match:

- categorical id fields should be integer features and use `categorical_column_with_identity`, `categorical_column_with_hash_bucket`, or another categorical column
- dense float vectors should use `numeric_column(..., shape=(dim,))`, not `embedding_column`

Example:

```python
feature_description = {
"article_id": tf.io.FixedLenFeature(shape=(1,), dtype=tf.int64),
"article_vector": tf.io.FixedLenFeature(shape=(128,), dtype=tf.float32),
"clicked": tf.io.FixedLenFeature(shape=(1,), dtype=tf.float32),
}

dnn_feature_columns = [
tf.feature_column.embedding_column(
tf.feature_column.categorical_column_with_identity("article_id", num_buckets=100000),
dimension=8,
),
tf.feature_column.numeric_column("article_vector", shape=(128,)),
]
linear_feature_columns = [
tf.feature_column.categorical_column_with_identity("article_id", num_buckets=100000),
tf.feature_column.numeric_column("article_vector", shape=(128,)),
]
```

For padded integer sequence fields in TFRecord, store the sequence as a fixed-length integer feature:

```python
feature_description = {
"hist_item_id": tf.io.FixedLenFeature(shape=(maxlen,), dtype=tf.int64),
}
```

Then use TensorFlow's categorical feature columns according to your Estimator model. The `VarLenSparseFeat(maxlen=...)` argument is part of the Keras-style API and is not used by `tf.feature_column`.

## Quick Checklist

- Use `0` for padding and start valid categorical ids from `1` for sequence fields.
- Set `vocabulary_size` to at least `max_id + 1`.
- Pad every `VarLenSparseFeat` input to `(batch_size, maxlen)`.
- Use `length_name` when you want explicit sequence lengths.
- Use `embedding_name` to share embeddings between candidate and history fields.
- For DIN/BST, history names must be `hist_` + the names in `behavior_feature_list`.
- For DSIN, split sessions offline and pass `sess_0_*`, `sess_1_*`, and `sess_length`.
- Use `DenseFeat(name, dimension)` or `numeric_column(name, shape=(dimension,))` for dense vectors.
6 changes: 6 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,12 @@ DisscussionGroup
FAQ<FAQ.md>
History<History.md>

.. toctree::
:maxdepth: 2
:caption: Cookbook:

Sequence Feature Cookbook<Sequence-Cookbook.md>

.. toctree::
:maxdepth: 3
:caption: API:
Expand Down
Loading