# Energy-Area Aware Channel Design for Multi-Chip Interfaces

Muhammad Waqas Chaudhary, Andy Heinig Fraunhofer Institute for Integrated Circuits IIS Division Engineering of Adaptive Systems EAS Zeunerstr. 38, 01069 Dresden, Germany {muhammad.chaudhary, andy.heinig}@eas.iis.fraunhofer.de

Abstract—Multi-chip communication interfaces on an interposer or a package substrate must consume minimum routing area while consuming low power in the transceiver blocks. This paper presents an algorithm to design this channel in view of energy and area metrics for a given transceiver topology. It is then show-cased using an example of silicon interposers.

*Index Terms*—2.5D/3D interconnects and packages, electronic packages and microsystems, high-speed channels

## I. INTRODUCTION

Moore's law is reaching a communication bottleneck in 2D systems, which has led to development of multi-chip systems to further enhance the system performance [1]. Such a memory-processor system on an interposer is shown in Figure 1. These chips must transfer high speed data between each other which requires high speed chip-to-chip interfaces [2]. These transceivers are, however, designed for a specific channel represented by scattering (S) parameters. They are then optimized at circuit level to achieve minimum power consumption for given interconnect at required data rate [3]. However, for optimal space utilisation in multi-chip systems, the routing area is an important constraint which should be co-optimized with the transmitter or at least optimized for a given transceiver architecture.

A co-design of area and current mode logic driver was previously presented [4]. However, it does not consider the equalization needs of the transceiver and the required power consumption. Lho et al. describe an optimization approach for high speed channel, but do not consider relationship with technology node, equalization requirements, and combined energy-area performance [5]. This paper meets these needs through an algorithm for combined optimization of transceiver and channel for minimum energy-area costs.

### II. DESIGN FLOW AND ALGORITHM

An overview diagram of design flow of the communication channel is shown in Figure 2. It consists of an extensive interconnect characterization, which is then used to derive the transceiver design constraints, especially with regards to drive strength, impedance matching and equalization. The energy consumption of transceiver with various interconnects is used to develop a combined performance metric of routing area and energy consumption. One can then derive the minimum energy-area measured by the performance metric Bhaskar Choubey Chair of Analogue Circuits and Image Sensors Siegen University Hölderlinstr. 3, 57076 Siegen, Germany bhaskar.choubey@uni-siegen.de



Fig. 1. Multi-chip interposer system model

of  $pJ/bit \cdot \mu m$  (product of energy efficiency pJ/bit and signalling pitch  $\mu m$ ) for given data rate, type of transceiver, substrate material and interconnect length.

This design flow is guided by the fact that while increasing the width of the interconnect leads to lower interconnect insertion loss, it also increases the signal routing pitch  $\rho$ , measured in  $\mu m$ . The very first step in the flow is hence, to characterize interconnects with various widths (W) and spacings (S) for a given length and substrate material. The interconnect s-parameters are then evaluated for a given data rate per wire (GSG) in single ended systems and per two-wires (GSSG) in differential signalling transceiver architectures.

The decrease in interconnect width leads to higher transceiver energy consumption, while an increase leads to higher signalling pitch. This flow of Figure 2 hence requires detailed analysis in each step and therefore an optimisation algorithm. This algorithmic approach with this flow would lead to an optimal channel design for a given substrate, bandwidth and transceiver topology. Such algorithm is derived in this paper. The algorithm is designed to be holistic and keeps the fixed constraints to as minimum as possible. In addition to the design space discussed earlier, it also considers different kinds of signalling topologies, and their correlation with channel area consumption along with total interface power consumption. This should therefore, provide an overall system level optimization.

To derive the algorithm, let us consider T as the set of possible transceiver topologies. This would include source series terminated signalling (SST), low voltage swing terminated logic (LVSTL), high swing push-pull signalling (CMOS)  $T \in \{SST, LVSTL, CMOS\}$ . The power consumption  $\phi$  for a given transceiver topology  $T_i \in T$  is a function of signalling pitch  $\rho$  defined by interconnect width, spacing and ground



Fig. 2. Channel design flow

width. This would be a sum of the transmitter and receiver power consumption as  $\phi_{T_i} = \phi_{Tx} + \phi_{Rx}$  where

$$\phi_{Tx} = [\phi_{Drv} + \phi_{Eq} + \phi_{Ser} + \phi_{Ckbuf}]$$
  
$$\phi_{Rx} = [\phi_{buf} + \phi_{Eq} + \phi_{DeSer} + \phi_{Ckbuf}]$$

Here  $\phi_{Drv}$ ,  $\phi_{Eq}$ ,  $\phi_{Ser}$  and  $\phi_{DeSer}$  represent the driver power, equalization, serialization and de-serialization blocks, respectively; while  $\phi_{Ckbuf}$  denotes the clock buffering and distribution block. The back-end blocks in transmitter and receiver like the serializer, de-serializer, clock buffers and samplers are indirectly influenced by the interconnect width and spacing variations. They are rather defined by the transmitter and receiver front-ends, i.e. driver, receiver amplifier and equalization. The energy-area metric  $\psi$  is hence, given as  $\phi/f_b * \rho$  in  $pJ/bit \cdot \mu m$  where  $f_b$  is the data bit rate in Gb/s.

The range of width W is defined with a minimum  $W_{min}$  and maximum  $W_{max}$  values in given interconnect technology. The spacing between interconnects is restricted by the minimum value  $S_{min}$  and generally does not go above a few times of the width of the signal line, e.g.  $3 \times W$ . For single ended GSG signalling using minimum width ground interconnect, the signalling pitch is given as  $\rho = W + W_{min} + 2S$ . The final energy-area performance metric  $\psi$  is then given as

$$\psi\left(T_{i},\rho\right) = \frac{\phi}{f_{b}}\left(W + W_{min} + 2S\right)$$

The algorithm iterates exhaustively through all possible combinations of width, spacing and transceiver topologies to find the minimum energy-area cost combination of (W, S, T).

Algorithm 1: Channel design **Result:** Optimum solution  $T_{opt}, W_{opt}, S_{opt}$ define Width range:  $W = \{W_{min}, \ldots, W_{max}\};$ define Spacing range:  $S = \{S_{min}, \ldots, S_{max}\};$ define Transceiver types:  $T_i \in T$ ; define Data bit rate:  $f_b$ ; define Interconnect average length: L; initialize  $\psi_{old}$ ; while  $T_i \in T$  do for  $W \leq W_{max}$  do for  $S \leq S_{max}$  do find S-parameters for given W, S; find pulse response for given  $f_b$ ; find required number of Taps for Tx; find required number of DFE Taps for Rx; calculate power consumption in Tx, Rx as  $\phi_{Tx} = [\phi_{Drv} + \phi_{Eq} + \phi_{Ser} + \phi_{Ckbuf}]$  $\phi_{Rx} = [\phi_{buf} + \phi_{Eq} + \phi_{DeSer} + \phi_{Ckbuf}]$ calculate signalling pitch as  $\rho = W + W_{min} + 2S;$ calculate interface energy-area cost as  $\psi = \frac{\phi}{f_h} \left( W + W_{min} + 2S \right);$ if  $\psi < \psi_{old}$  then  $\mid T_{opt} = T_i, W_{opt} = W, S_{opt} = S;$ end update  $\psi_{old} = \psi$ ; end end end

#### III. CASE STUDY: SILICON SUBSTRATE CHANNEL

To understand the algorithm, a silicon interposer chip to chip interface is being presented as a case study. The stackup for this system is shown in Figure 3, where two metal layers in silicon-dioxide are placed on a silicon substrate. The tangent loss  $(tan\delta)$ , here is dependent upon the resistivity, which for typical  $100 \Omega \cdot cm$  is chosen to be 0.1 for data rates around 5-10 GHz [6]. The length of the interconnect is selected as 10 mm. The impact of width variation on the channel insertion loss S21 from 1 to 2 µm is shown in Figure 4. The data rate for this study is chosen as 10 Gb/s which has the Nyquist frequency of 5 GHz, at which 2 µm wide line has frequency dependent loss of only -2 dB while 1 µm has insertion loss of -7 dB. It should be noted that there is 6 dB higher DC loss in 1 µm wide line which leads to a reduced voltage swing at the Rx input.

To study the equalization and voltage swing requirements, the channel is excited at the transmitter side with a  $10 \,\mathrm{Gb/s}$ pulse with ideal rise time (1 ps) and unit interval (UI) of 0.1 ns. The received pulse response after channel is shown in Figure 5. As expected due to high resistivity of interconnect and DC loss, the voltage swing is just 0.2 V for 1 µm wide line.

From the pulse response in Figure 5, it can be observed that there is no pre-cursor inter symbol interference (ISI) for both lines. The signal rises within 1UI completely, as depicted



Fig. 3. Stackup for silicon interposer based multi-chip system

by the dotted blue line at 1UI tick of x-axis. However, both interconnects show some post-cursor ISI, as shown by the red dashed lines. The behavior is similar to an RC exponential voltage drop, especially significant in 1  $\mu$ m wide line. In order to completely cancel the post-cursor ISI, a high continuous time linear equalization (CTLE) or a number of decision feedback equalization (DFE) taps will be required, which shall impact the power consumption of the transceiver. For 1  $\mu$ m wide line, at least two DFE taps for 2nd and 3rd UI ISI cancellation are required. For 2  $\mu$ m line, however, only 1-tap DFE equalization for 2nd UI ISI cancellation is enough.

The power consumption estimate for different data rates and equalization requirement is based upon the work by Palaniappan et al. in [3]. A Continuous-Time Linear Equalizer (CTLE) based post-cursor ISI cancellation is generally used up to 12 dB insertion loss. This is due to the fact that CTLE is a part of Rx input amplifier and hence, increases the power equally for signal and the noise. Therefore, for significantly higher losses than 12 dB, DFE taps are used which are calculated directly from the impulse response shown in Figure 5.

By using CTLE for equalization and  $0.1 \,\mathrm{mW/Gb/s}$  power for every 6 dB bandwidth peaking [3] at 10 Gb/s in 90 nm technology node, extra  $\phi_{Eq}$  of 1 mW is added to the total power consumption  $\phi_{Rx}$  of 1 µm wide wire interface as compared to the 2 µm wire interface. The equalization constraints in CTLE and DFE will directly impact other design parameters for driver, samplers and clock buffers in Tx and Rx. However, if we ignore them for quick comparison of energy-area metric and just consider CTLE requirements,  $\psi$ 



Fig. 4. S-parameters extracted using HSPICE 2D field solver



Fig. 5. Response for  $10 \,\mathrm{Gb/s}$  input pulse with  $1 \,\mathrm{ps}$  rise time

would be 0.7 and  $0.8 \text{ pJ/bit} \cdot \mu \text{m}$  for 1 and  $2 \mu \text{m}$  wide wire interface respectively.

This shows that at specific data rates and equalization requirements, higher power consumption is not that detrimental if overall energy-area cost metric is used. However, noting the CTLE power consumption being constant for 6 and 12 dB peaking in 45 nm technology [3], the  $\psi$  for wider 2 µm interconnect interface would be an even worse choice than in 90 nm node. This leads to a conclusion that wider lines for high speed chip to chip links are useful from energy-area perspective in older technology nodes. But for newer nodes in the range of 45, 20 and 14 nm, thin wires with high receiver side equalization requirements are a better choice.

## IV. CONCLUSION

A design flow for energy-area aware channel design for high speed chip to chip links is presented using silicon interposer interface case study. The flow shows that energy-area tradeoffs can lead to an optimized interconnect width and spacing for given data rate, transceiver type and technology node.

#### REFERENCES

- P. Vivet, E. Guthmuller et al., "2.3 a 220gops 96-core processor with 6 chiplets 3d-stacked on an active interposer offering 0.6ns/mm latency, 3tb/s/mm 2 inter-chiplet interconnects and 156mw/mm 2 @ 82%-peakefficiency dc-dc converters," in 2020 IEEE International Solid- State Circuits Conference - (ISSCC). IEEE, 2/16/2020 - 2/20/2020, pp. 46–48.
- [2] T. O. Dickson, Y. Liu *et al.*, "A 1.4 pj/bit, power-scalable 16×12 gb/s source-synchronous i/o with dfe receiver in 32 nm soi cmos technology," *IEEE Journal of Solid-State Circuits*, vol. 50, no. 8, pp. 1917–1931, 2015.
- [3] A. Palaniappan and S. Palermo, "A design methodology for power efficiency optimization of high-speed equalized-electrical i/o architectures," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 21, no. 8, pp. 1421–1431, 2013.
- [4] M. W. Chaudhary and A. Heinig, "Co-design of cml io and interposer channel for low area and power signaling," in *Formal proceedings of the* 2016 IEEE 19th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS), J. Brenkuš and Stopjaková, Eds. IEEE, 2016, pp. 1–6.
- [5] D. Lho, J. Park *et al.*, "Bayesian optimization of high-speed channel for signal integrity analysis," in *EPEPS 2019*. IEEE, 2019, pp. 1–3.
- [6] R.-Y. Yang, C.-Y. Hung *et al.*, "Loss characteristics of silicon substrate with different resistivities," *Microwave and Optical Technology Letters*, vol. 48, no. 9, pp. 1773–1776, 2006.