Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
168 changes: 157 additions & 11 deletions docs/source/analog_update.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,21 +37,25 @@ direction. Each of applied voltage pulses has the same strength in theory but th
These three example traces show the implemented ReRAM model in the simulator, and it shows that it captures the measured conductance response curve quite well.
One can also see the device-to-device variability in this case as illustrated by the three different colored plots. Here we show 3 different device updates.

We have implemented 3 different ways to perform the update in Analog and hope to extend the number of available optimizers in the future:
We have implemented several different ways to perform the update in Analog and hope to extend the number of available optimizers in the future:

* Plain SGD: Fully parallel update using stochastic pulse trains by Gokmen & Vlasov::ref:`[9] <references>`.
* Mixed precision: Digital rank update and transfer by Nandakumar et al.::ref:`[4] <references>`.
* Tiki-taka: Momentum-like SGD update by Gokmen & Haensch::ref:`[10] <references>`.
* Tiki-taka (TTv1): Momentum-like SGD update by Gokmen & Haensch::ref:`[10] <references>`.
* TTv2: Buffered transfer with a floating-point H buffer by Gokmen::ref:`[16] <references>`.
* TTv3 (c-TTv2): Chopped-TTv2, buffered transfer with input/output choppers by Rasch et al.::ref:`[17] <references>`.
* TTv4 (AGAD): Analog Gradient Accumulation with Dynamic reference by Rasch et al.::ref:`[17] <references>`.

These algorithmic improvements and the adaptation of existing algorithms to the characteristics of Analog hardware is one of the key focus areas of this toolkit.

Plain SGD optimizer implements a fast way to do the gradient update fully in Analog using coincidences of stochastic pulse trains to compute
the outer product as was suggested by the paper of Gokmen & Valsov::ref:`[9] <references>`. The Mixed precision optimizer was proposed by Nandakumar
et al in 2020::ref:`[4] <references>`. In this optimzer, the outer product to form the weight gradients is computed in digital. Compared to the first optimizer, we have more digital
compute units on this chip than the first one which has the update fully in parallel. This would be a good choice for much more non-ideal devices. The third
optimizer called Tiki Taka implements an algorithm that is similar to momentum stochastics gradient decent and assumes that both the momentum term and
the weight matrix are on analog cross bar arrays as discussed in:ref:`[10] <references>`. The gradient update computation onto the momentum matrix uses the same fast update it it was
explained in the plain SGD case.
Plain SGD optimizer implements a fast way to do the gradient update fully in Analog using coincidences of stochastic pulse trains to compute
the outer product as was suggested by the paper of Gokmen & Vlasov::ref:`[9] <references>`. The Mixed precision optimizer was proposed by Nandakumar
et al in 2020::ref:`[4] <references>`. In this optimizer, the outer product to form the weight gradients is computed in digital. Compared to the first optimizer, we have more digital
compute units on this chip than the first one which has the update fully in parallel. This would be a good choice for much more non-ideal devices. The Tiki-taka
optimizer (TTv1) implements an algorithm that is similar to momentum SGD and assumes that both the momentum term and
the weight matrix are on analog crossbar arrays as discussed in :ref:`[10] <references>`. TTv2 adds a floating-point H buffer between the fast and slow
arrays :ref:`[16] <references>`, enabling lossless accumulation of fractional gradient steps. TTv3 (c-TTv2) further introduces input/output choppers that suppress
systematic bias :ref:`[17] <references>`, and TTv4 (AGAD) extends TTv3 with a statistical approach for computing the gradient update :ref:`[17] <references>`.

Plain SGD: Fully Parallel Update
---------------------------------
Expand Down Expand Up @@ -99,8 +103,8 @@ See `example 12 <https://github.com/IBM/aihwkit/blob/master/examples/12_simple_l
for an illustration of how to use the mixed precision update in the aihwkit::ref:`[4] <references>`.


Tiki-taka: Momentum-like SGD Update
-----------------------------------
Tiki-taka (TTv1): Momentum-like SGD Update
------------------------------------------
Tiki-Taka optimizer is also algorithmically similar to momentum SGD. The difference here is that the momentum matrix is also in Analog.
This implied that the outer product update onto the momentum matrix is done on analog in fully parallel mode using stochastic pulse trains
we described earlier. Therefore, this optimizer does not have the potential bottleneck to compute the outer product in digital as done in the
Expand All @@ -111,3 +115,145 @@ This is explained in more details in this paper.

.. image:: ../img/tikitaka.png
:alt: Tiki-taka: Momentum-like SGD Update

**TTv1 Formulation**

The core update equations for Tiki-taka (TTv1) are:

.. math::

A = A \mathrel{-} \beta \cdot \text{Gradient}

.. math::

C = C \mathrel{+} \alpha \cdot A

Where:

* :math:`A` is the fast (momentum) array, updated at every gradient step with learning rate :math:`\beta`
* :math:`C` is the slow (weight) array, updated periodically via transfer events with coefficient :math:`\alpha`
* The gradient is computed on :math:`\gamma \cdot A + C`, where :math:`\gamma` controls the contribution of A to the effective weight.

The key distinguishing feature is that momentum decay is achieved implicitly through device asymmetry (random up/down pulses on :math:`A`) rather than explicit multiplicative decay, which is difficult to implement in analog hardware.

**Residual Learning and Bit-Slicing with Non-Zero** :math:`\gamma`

The ``gamma`` parameter enables two complementary mechanisms in TTv1-TTv3:

.. math::

W_{\text{eff}} = \gamma \cdot A + C

The gradient is evaluated at the effective weight :math:`\gamma A + C` rather than at C alone,
so A can directly influence the gradient direction and magnitude.
The relative contribution of A is controlled by ``gamma``:

**When** ``gamma = 0`` **(default):** A is fully hidden — gradients are
evaluated only at C. A acts as a hidden momentum buffer whose content is
periodically transferred to C. Because transfers are discrete and
infrequent, C may lag the true gradient direction, introducing gradient
staleness.

**When** ``gamma > 0`` **:** A becomes an active *residual branch* on top of C,
enabling two complementary mechanisms:

1. **Residual learning:** A can now track the residual of C: after each
transfer, any remaining deviation of C from the ideal weight (due to
device non-linearity, write noise, saturation, or drift) is visible in the
gradient evaluated at :math:`\gamma A + C`. This gradient drives A in the
direction that corrects C's error, so A continuously compensates for
whatever C fails to represent. When the next transfer event fires, the
correction accumulated in A is pushed into C, pulling it closer to the
ideal weight. The mechanism is analyzed in detail by Wu et al.::ref:`[18] <references>`.

2. **Bit-slicing (precision enhancement):** The two-layer decomposition
:math:`W = \gamma A + C` acts as a *bit-slicing* mechanism: the fast array A
can represent finer-grained weight updates (higher effective precision) while
the slow array C provides stable storage of the coarse weight values. By
tuning ``gamma`` and the transfer frequency, the effective weight granularity
can be reduced below the device's native conductance step, enabling higher
training accuracy without modifying the underlying analog device. This
approach is particularly valuable when C's device granularity is coarse or
non-uniform. See Li et al.::ref:`[19] <references>` for its extention to multi-array
setting as well as the detailed analysis.

TTv2: Buffered Transfer
-----------------------
The buffered transfer algorithm (TTv2), proposed by Gokmen::ref:`[16] <references>`, extends
Tiki-taka by introducing a floating-point H buffer between the fast analog array A and the
slow weight array C. Instead of sending stochastic update pulses to C at every gradient step,
each transfer event first reads a column of A and accumulates the result in the digital buffer H:

.. math::

H \mathrel{+}= \alpha \cdot A

where :math:`\alpha` is a learning-rate scale factor. An integer number of pulses is sent to C
only when the accumulated value exceeds the threshold :math:`|H| \geq 1`, after which H is
reduced by the number of steps taken (or decayed by a momentum factor when ``forget_buffer=True``).

This buffered scheme provides two key advantages over plain Tiki-taka (TTv1):

* **Reduced write noise on C** — pulses are sent to the slow device only when the buffer
is large enough to justify a full integer step, so C is updated less frequently and with
steps that match its conductance granularity.
* **Lossless accumulation** — fractional gradient contributions that would otherwise be
rounded away by the finite granularity of C are preserved in the floating-point buffer
until they can be committed.

The algorithm is configured via
:class:`~aihwkit.simulator.configs.compounds.BufferedTransferCompound`.

TTv3 (c-TTv2): Chopped Buffered Transfer
-----------------------------------------
TTv3, originally named **Chopped-TTv2 (c-TTv2)** by Rasch et al.::ref:`[17] <references>`, extends TTv2 by adding *choppers* —
random binary sign-flip patterns applied to the input (and optionally output) of each transfer
read. After each column read of A, the row chopper sign is randomly toggled with probability
``in_chop_prob``; the current chopper state multiplies both the update written to A and the
value accumulated in H:

.. math::

H \mathrel{+}= \text{chopper} \cdot \alpha \cdot A

Because the chopper sign is applied consistently to both the write and the read, the effective
gradient in H is unbiased. Systematic device offsets and long-range correlations on the fast
array A average out over successive chopper flips, enabling more aggressive transfer rates
without accumulating systematic errors on C.

The standard TTv3 transfer logic — accumulation, threshold crossing, and pulse dispatch to C —
is identical to TTv2. The sole difference is that all reads of A are chopper-modulated. Both
input choppers (``in_chop_prob``) and output choppers (``out_chop_prob``) can be configured
independently.

The algorithm is configured via
:class:`~aihwkit.simulator.configs.compounds.ChoppedTransferCompound`.

.. _using_simulator: using_simulator.html

TTv4 (AGAD): Dynamic Chopped Transfer
---------------------------------------
TTv4, originally named **Analog Gradient Accumulation with Dynamic reference (AGAD)** by
Rasch et al.::ref:`[17] <references>`, extends TTv3 by introducing a dynamic *symmetric point
tracking* mechanism for establishing reference values on-the-fly, using a modest amount of
additional digital compute, rather than relying on a separate reference conductance array or
differential read circuitry.

Concretely, TTv4 establishes dynamic symmetric points by comparing the running mean of reads
taken during the two most recent chopper half-periods. The transfer onto C is proportional to
the *difference* between these two half-period means:

.. math::

\Delta C \propto \bar{A}_{\text{new}} - \bar{A}_{\text{old}}

No update is dispatched to C when this difference is not statistically distinguishable from
noise, as judged by the running standard-deviation estimate (i.e., a standard-error of the
mean noise gate is applied). Because the reference values are derived from the device reads
themselves rather than from a separately measured baseline, AGAD greatly simplifies hardware
design — it does not need a separate conductance array for reference values or differential
read circuitry.

The algorithm is configured via
:class:`~aihwkit.simulator.configs.compounds.DynamicTransferCompound`.

19 changes: 15 additions & 4 deletions docs/source/paper_references.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,17 @@ Paper References
* [14] 2023 Nature,
`An analog-AI chip for energy-efficient speech recognition and transcription`_

* [15] 2025 NeurIPS,
`Analog In-memory Training on General Non-ideal Resistive Elements: The Impact of Response Functions`_

* [16] 2021 Frontiers in Artificial Intelligence,
`Enabling Training of Neural Networks on Noisy Hardware`_

* [17] 2024 Nature Communications,
`Fast and robust analog in-memory deep neural network training`_

* [18] 2026 AISTATS,
`In-memory Training on Analog Devices with Limited Conductance States via Multi-tile Residual Learning`_

.. _`Memory devices and applications for in-memory computing`: https://www.nature.com/articles/s41565-020-0655-z
.. _`Accurate deep neural network inference using computational phase-change memory`: https://www.nature.com/articles/s41467-020-16108-9
Expand All @@ -60,7 +71,7 @@ Paper References
.. _`Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators`: https://www.nature.com/articles/s41467-023-40770-4
.. _`A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference`: https://www.nature.com/articles/s41928-023-01010-1
.. _`An analog-AI chip for energy-efficient speech recognition and transcription`: https://www.nature.com/articles/s41586-023-06337-5




.. _`Analog In-memory Training on General Non-ideal Resistive Elements: The Impact of Response Functions`: https://openreview.net/forum?id=WhEPg4mUs6
.. _`Enabling Training of Neural Networks on Noisy Hardware`: https://www.frontiersin.org/articles/10.3389/frai.2021.699148/full
.. _`Fast and robust analog in-memory deep neural network training`: https://www.nature.com/articles/s41467-024-51221-z
.. _`In-memory Training on Analog Devices with Limited Conductance States via Multi-tile Residual Learning`: https://arxiv.org/abs/2510.02516
Loading
Loading