Error message

Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Notice: Trying to access array offset on value of type int in element_children() (line 6489 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).
Deprecated function: implode(): Passing glue string after array is deprecated. Swap the parameters in drupal_get_feeds() (line 394 of /home1/dezafrac/public_html/ninethreefox/includes/common.inc).

7

Submitted by flotaganis on Sat, 2022-06-11 13:15

Forums:

computer architecture a quantitative approach 5th edition solutions manual

LINK 1 ENTER SITE >>> Download PDF
LINK 2 ENTER SITE >>> Download PDF

File Name:computer architecture a quantitative approach 5th edition solutions manual.pdf
Size: 2105 KB
Type: PDF, ePub, eBook

Category: Book
Uploaded: 26 May 2019, 20:40 PM
Rating: 4.6/5 from 844 votes.

Status: AVAILABLE

Last checked: 5 Minutes ago!

In order to read or download computer architecture a quantitative approach 5th edition solutions manual ebook, you need to create a FREE account.

Download Now!

eBook includes PDF, ePub and Kindle version

✔ Register a free 1 month Trial Account.

✔ Download as many books as you like (Personal use)

✔ Cancel the membership at any time if not satisfied.

✔ Join Over 80000 Happy Readers

computer architecture a quantitative approach 5th edition solutions manualAll rights reserved.\n\f2.12\n\n2.13\n\n2.14\n\n2.15\n\n2.16\n\nChapter 2 Solutions 9\n\na. 16B, to match the level 2 data cache write path.\n\nb. Assume merging write buffer entries are 16B wide. Since each store can\nwrite 8B, a merging write buffer entry would fill up in 2 cycles. The level-2\ncache will take 4 cycles to write each entry. A non-merging write buffer\nwould take 4 cycles to write the 8B result of each store. This means the\nmerging write buffer would be 2 times faster.\n\nc. With blocking caches, the presence of misses effectively freezes progress\nmade by the machine, so whether there are misses or not doesn\u2019t change the\nrequired number of write buffer entries. If the memory\n\nCopyright \u00a9 2012 Elsevier, Inc. From Figure 2.14, this\nis just barely within the bandwidth provided by DDR2-667 DIMMs, so just one\nmemory channel would suffice.\n\na. The system built from 1Gb DRAMs will have twice as many banks as the\nsystem built from 2Gb DRAMSs. Thus the 1Gb-based system should provide\nhigher performance since it can have more banks simultaneously open.\n\nb. The power required to drive the output lines is the same in both cases, but the\nsystem built with the x4 DRAMs would require activating banks on 18 DRAMs,\nversus only 9 DRAMSs for the x8 parts. The page size activated on each x4 and\nx8 part are the same, and take roughly the same activation energy. If the accesses are back to back, then this is not possible. This\nnew constrain will not impact policy 1.\n\nCopyright \u00a9 2012 Elsevier, Inc. Similar behavior with\ndifferent flattening points on L2 and L3 caches are observed.\n\nb. The IPC decreases by 60, 20, and 66 when input data size goes from\n8KB to 128 KB, from 128KB to 4MB, and from 4MB to 32MB, respectively.\nThis shows the importance of all caches. Among all three levels, LI and L3\ncaches are more important.http://mobiligennari.com/userfiles/docman-1_6-manual.xml

computer architecture a quantitative approach 5th edition solutions manual pdf, computer architecture a quantitative approach fifth edition solution manual, solution manual computer architecture a quantitative approach 5th edition, 1.0, computer architecture a quantitative approach 5th edition solutions manual pdf, computer architecture a quantitative approach fifth edition solution manual, solution manual computer architecture a quantitative approach 5th edition.

This is because the L2 cache in the Intel\u00ae Xeon\u00ae\nProcessor X5680 is relatively small and slow, with capacity being 256KB and\nlatency being around 11 cycles.\n\nc. For a recent Intel i7 processor (3.3GHz Intel\u00ae Xeon\u00ae Processor X5680),\nwhen the data set size is increased from 8KB to 128KB, the number of L1\nDeache misses per 1K instructions increases by around 300, and the number\nof L2 cache misses per 1K instructions remains negligible. With a 11 cycle\nmiss penalty, this means that without prefetching or latency tolerance from\nout-of-order issue we would expect there to be an extra 3300 cycles per 1K\ninstructions due to L1 misses, which means an increase of 3.3 cycles per\ninstruction on average. All rights reserved.\n\f3.1\n\n3.2\n\nChapter 3 Solutions 13\n\nChapter 3 Solutions\n\nCase Study 1: Exploring the Impact of Microarchitectural\nTechniques\n\nThe baseline performance (in cycles, per loop iteration) of the code sequence in\nFigure 3.48, if no new instruction\u2019s execution could be initiated until the previ-\nous instruction\u2019s execution had completed, is 40. See Figure S.2. Each instruc-\ntion requires one clock cycle of execution (a clock cycle in which that\ninstruction, and only that instruction, is occupying the execution units; since\nevery instruction must execute, the loop will take at least that many clock\ncycles). To that base number, we add the extra latency cycles. Until that output is ready, no dependent\ninstructions can be executed. So the first LD must stall the next instruction for\nthree clock cycles. The MULTD produces a result for its successor, and therefore\nmust stall 4 more clocks, and so on.\n\nCopyright \u00a9 2012 Elsevier, Inc. Assume\nresults can be immediately forwarded from one execution unit to another, or to itself.\nFurther assume that the only reason an execution pipeline would stall is to observe a\ntrue data dependency. Now how many cycles does the loop require.http://ljlconst.com/admin/images/docstoc-manual-testing.xml The answer\nis 22, as shown in Figure S.4. The LD goes first, as before, and the DIVD must wait\nfor it through 4 extra latency cycles. After the DIVD comes the MULTD, which can run\nin the second pipe along with the DIVD, since there\u2019s no dependency between them.\n(Note that they both need the same input, F2, and they must both wait on F2\u2019s readi-\nness, but there is no constraint between them.) The LD following the MULTD does not\ndepend on the DIVD nor the MULTD, so had this been a superscalar-order-3 machine,\n\nCopyright \u00a9 2012 Elsevier, Inc. The loop overhead instructions at the loop\u2019s\nbottom also exhibit some potential for concurrency because they do not depend on.\nany long-latency instructions.\n\nPossible answers:\n\n1. All rights reserved.\n\f16 Solutions to Case Studies and Exercises\n\n3.5\n\nLong-latency ops are at highest risk of being passed by a subsequent op. Then update all\nthe sre (source) registers accordingly, so that true data dependencies are main-\ntained. All rights reserved.\n\f18 Solutions to Case Studies and Exercises\n\n3.8 See Figure S.8. The rename table has arbitrary values at clock cycle N \u2014 1. Look at\nthe next two instructions (10 and 1): 10 targets the F1 register, and I will write the F4\nregister. This means that in clock cycle N, the rename table will have had its entries 1\nand 4 overwritten with the next available Temp register designators. I0 gets renamed\nfirst, so it gets the first T reg (9). In clock cycle N,\ninstructions I2 and I3 come along; 12 will overwrite F6, and 13 will write FO. This\nmeans the rename table\u2019s entry 6 gets 11 (the next available T reg), and rename table\nentry 0 is written to the T reg after that (12). What could go wrong\nwith this. If an interrupt is taken between clock cycles 1 and 4, then the results of the LW\nat cycle 2 will end up in R1, instead of the LW at cycle 1. Bank stalls and ECC stalls will\ncause the same effect\u2014pipes will drain, and the last writer wins, a classic WAW hazard.\nAll other \u201cintermediate\u201d results are lost.\n\n3.11 See Figure S.11. The convention is that an instruction does not enter the execution\nphase until all of its operands are ready. So the first instruction, LW R3,0(RO),\nmarches through its first three stages (F, D, E) but that M stage that comes next\nrequires the usual cycle plus two more for latency. All rights reserved.\n\f20\n\nSolutions to Case Studies and Exercises\n\n3.12\n\na. 4 cycles lost to branch overhead. Without bypassing, the results of the SUB\ninstruction are not available until the SUB\u2019s W stage. A dynamic branch predictor\nremembers that when the branch instruction was fetched in the past, it eventu-\nally turned out to be a branch, and this branch was taken. So a \u201cpredicted taken\u201d\nwill occur in the same cycle as the branch is fetched, and the next fetch after\nthat will be to the presumed target. It feeds the next ADDD, and ADDD\n3 feeds the SD below. With reg renaming, doesn't have\n3 to wait until the LD of (a different) F4 has\n3 completed.\n\nSUB R20,R4,Rx\n\nBNZ R20, Loop\n\n \n\nFigure S.12 Instructions in code where register renaming improves performance.\n\nCopyright \u00a9 2012 Elsevier, Inc. All rights reserved.\n\fChapter 3 Solutions 21\n\nb. Think of this exercise from the\nReservation Station\u2019s point of view: at any given clock cycle, it can only\n\u201csee\u201d the instructions that were previously written into it, that have not\nalready dispatched. All rights reserved.\n\n \n\fChapter 3 Solutions 23\n\n1. Another ALU: 0 improvement\n2. Cutting longest latency in half: divider is longest at 12 cycles. IFRS schedules 2nd loop's critical LD in cycle 2, then\nloop 2's critical dependency chain will be the same length as loop 1'sis. Since we're not\nfunctional-unit-limited for this code, only one extra clock cycle is needed.http://www.metinadistribuzione.com/images/compupool-salt-chlorine-generator-manual.pdf\n\nCopyright \u00a9 2012 Elsevier, Inc. All rights reserved.\n\f24\n\nSolutions to Case Studies and Exercises\n\n3.13\n\nExercises\n\na. All rights reserved.\n\f3.18\n\nChapter 3 Solutions 31\n\nFor this problem we are given the base CPI without branch stalls. Storing the target instruction of an unconditional branch effectively removes\none instruction. If there is a BTB hit in instruction fetch and the target\ninstruction is available, then that instruction is fed into decode in place of the\nbranch instruction. The penalty is -1 cycle. The hit percentage\nto just break even is simply 20.\n\nCopyright \u00a9 2012 Elsevier, Inc.All rights reserved.\n\f46\n\nSolutions to Case Studies and Exercises\n\n5.5\n\n56\n\n. pO: read 120, Read mi\n\nd. All rights reserved.\n\f57\n\nChapter 5 Solutions 47\n\nd. Assume the processors acquire the lock in order. PO will acquire it first, incur-\nring 100 stall cycles to retrieve the block from memory. P1 and P3 will stall\nuntil PO\u2019s critical section ends (ping-ponging the block back and forth) 1000\ncycles later. PO will stall for (about) 40 cycles while it fetches the block to\ninvalidate it; then P1 takes 40 cycles to acquire it. P1\u2019s critical section is 1000\ncycles, plus 40 to handle the write miss at release. Finally, P3 grabs the block\nfor a final 40 cycles of stall. So, PO stalls for 100 cycles to acquire, 10 to give\nit to P1, 40 to release the lock, and a final 10 to hand it off to P1, for a total of\n160 stall cycles. Finally, P3\ngets the lock 40 cycles later, so it stalls a total of 2280 cycles.\n\nb. The optimized spin lock will have many fewer stall cycles than the regular\nspin lock because it spends most of the critical section sitting in a spin loop\n(which while useless, is not defined as a stall cycle). So approximately 945 cycles total.\n\nc. Approximately 31 interconnect transactions. The first processor to win arbi-\ntration for the interconnect gets the block on its first try (1); the other two\nping-pong the block back and forth during the critical section. Because the\nlatency is 40 cycles, this will occur about 25 times (25). The first processor\ndoes a write to release the lock, causing another bus transaction (1), and the\nsecond processor does a transaction to perform its test and set (1). The last\nprocessor gets the block (1) and spins on it until the second processor releases\nit (1). Finally the last processor grabs the block (1).\n\nCopyright \u00a9 2012 Elsevier, Inc. All rights reserved.\n\f48\n\nSolutions to Case Studies and Exercises\n\nd. Approximately 15 interconnect transactions. Assume processors acquire the\nlock in order. All three processors do a test, causing a read miss, then a test\nand set, causing the first processor to upgrade and the other two to write\nmiss (6). The losers sit in the test loop, and one of them needs to get back a\nshared block first (1). When the first processor releases the lock, it takes a\nwrite miss (1) and then the two losers take read misses (2). Both have their\ntest succeed, so the new winner does an upgrade and the new loser takes a\nwrite miss (2). The loser spins on an exclusive block until the winner releases\nthe lock (1). The loser first tests the block (1) and then test-and-sets it, which\nrequires an upgrade (1).\n\n5.8 Latencies in implementation 1 of Figure 5.36 are used.\n\n59\n\na. PO: write 110 \u20ac 80\nPO: read 108\nb. PO: write 100 \u20ac 80\n\nPO: read 108\n\nc. PO: write 110 \u20ac 80\nPO: write 100 \u20ac 90\n\nd. All rights reserved.\n\fChapter 5 Solutions 49\n\n5.10 a. PO,0: write 100 \u20ac 80, Write hit only seen by PO,0\n\nb. PO0,0: write 108 \u00a9 88, Write \u201cupgrade\u201d received by P0,0; invalidate received\nby P3,1\n\nc. PO,0: write 118 \u00a9 90, Write miss received by PO,0; invalidate received by P1,0\nd. It also allows silent downgrades to I,\nallowing the processor to discard its copy with notifying memory. The memory\nmust have a way of inferring either of these transitions. In a directory-based system,\nthis is typically done by having the directory assume that the node is in state M and\nforwarding all misses to that node. If a node has silently downgraded to I, then it\nsends a NACK (Negative Acknowledgment) back to the directory, which then\ninfers that the downgrade occurred. However, this results in a race with other mes-\nsages, which can cause other problems.\n\nCopyright \u00a9 2012 Elsevier, Inc. PO,0: read 100 Read hit, 1 cycle\n\nb. It is crucial that the protocol implementation guarantee (at least with a\nprobabilistic argument) that a processor will be able to perform at least one mem-\nory operation each time it completes a cache miss. Otherwise, starvation might\nresult. If a processor is not guaranteed to be able to perform at least one\ninstruction, then each could steal the block from the other repeatedly. In the worst\ncase, no processor could ever successfully perform the exchange.\n\nCopyright \u00a9 2012 Elsevier, Inc. All rights reserved.\n\f5.18\n\n5.20\n\nChapter 5 Solutions 53\n\na. P1,0: read 100\nP3,1: write 100 \u20ac 90\n\nIn this problem, both P0,1 and P3,1 miss and send requests that race to the\ndirectory. Assuming that PO,1\u2019s GetS request arrives first, the directory will\nforward PO,1\u2019s GetS to P0,0, followed shortly afterwards by P3,1\u2019s GetM. If\nthe network maintains point-to-point order, then P0,0 will see the requests in\nthe right order and the protocol will work as expected. That latter number depends on both the topology and the application.\n\nc. Since the CPU frequency and the number of instructions executed did not\nchange, the answer can be obtained by the CPI for each of the topologies\n(worst case or average) by the base (no remote communication) CPI.\n\nTo keep the figures from becoming cluttered, the coherence protocol is split into\ntwo parts as was done in Figure 5.6 in the text. Figure S.34 presents the\nCPU portion of the coherence protocol, and Figure S.35 presents the bus portion\nof the protocol. In both of these figures, the arcs indicate transitions and the text\nalong each arc indicates the stimulus (in normal text) and bus action (in bold text)\nthat occurs during the transition between states. Finally, like the text, we assume a\nwrite hit is handled as a write miss.\n\nFigure S.34 presents the behavior of state transitions caused by the CPU itself. In\nthis case, a write to a block in either the invalid or shared state causes us to broad-\ncast a \u201cwrite invalidate\u201d to flush the block from any other caches that hold the\nblock and move to the exclusive state. We can leave the exclusive state through\neither an invalidate from another processor (which occurs on the bus side of the\ncoherence protocol state diagram), or a read miss generated by the CPU (which\noccurs when an exclusive block of data is displaced from the cache by a second\nblock). In the shared state only a write by the CPU or an invalidate from another\nprocessor can move us out of this state. In the case of transitions caused by events\nexternal to the CPU, the state diagram is fairly simple, as shown in Figure S.35.\nWhen another processor writes a block that is resident in our cache, we uncondi-\ntionally invalidate the corresponding block in our cache. This ensures that the\nnext time we read the data, we will load the updated value of the block from\nmemory. Also, whenever the bus sees a read miss, it must change the state of an\nexclusive block to shared as the block is no longer exclusive to a single cache.\n\nThe major change introduced in moving from a write-back to write-through\ncache is the elimination of the need to access dirty blocks in another processor\u2019s\ncaches. As a result, in the write-through protocol it is no longer necessary to pro-\nvide the hardware to force write back on read accesses or to abort pending mem-\nory accesses. As memory is updated during any write on a write-through cache, a\nprocessor that generates a read miss will always retrieve the correct information\nfrom memory. Basically, it is not possible for valid cache blocks to be incoherent\nwith respect to main memory in a system with write-through caches.\n\nCopyright \u00a9 2012 Elsevier, Inc. The following three transitions are those\nthat change.\n\na from Dirty Exclusive to Shared, the label changes to CPU read miss on a\nShared block\n\na from Invalid to Shared, the label changes to CPU miss on a Shared block\n\na from Shared to Shared, the miss transition label changes to CPU read miss on\na Shared block\n\nAn obvious complication introduced by providing a valid bit per word is the need\nto match not only the tag of the block but also the offset within the block when\nsnooping the bus. This is easy, involving just looking at a few more bits. In addi-\ntion, however, the cache must be changed to support write-back of partial cache\nblocks. When writing back a block, only those words that are valid should be writ-\nten to memory because the contents of invalid words are not necessarily coherent\n\nCopyright \u00a9 2012 Elsevier, Inc. All rights reserved.\n\f5.24\n\nChapter 5 Solutions 57\n\nwith the system. Finally, given that the state machine of Figure 5.7 is applied at\neach cache block, there must be a way to allow this diagram to apply when state\ncan be different from word to word within a block. The easiest way to do this would\nbe to provide the state information of the figure for each word in the block. Doing\nso would require much more than one valid bit per word, though. Without replica-\ntion of state information the only solution is to change the coherence protocol\nslightly.\n\na. The instruction execution component would be significantly sped up because\nthe out-of-order execution and multiple instruction issue allows the latency of\nthis component to be overlapped. The cache access component would be sim-\nilarly sped up due to overlap with other instructions, but since cache accesses\ntake longer than functional unit latencies, they would need more instructions\nto be issued in parallel to overlap their entire latency. So the speedup for this\ncomponent would be lower.\n\nThe memory access time component would also be improved, but the\nspeedup here would be lower than the previous two cases. Because the mem-\nory comprises local and remote memory accesses and possibly other cache-\nto-cache transfers, the latencies of these operations are likely to be very high\n(100\u2019s of processor cycles). The 64-entry instruction window in this example\nis not likely to allow enough instructions to overlap with such long latencies.\nThere is, however, one case when large latencies can be overlapped: when\nthey are hidden under other long latency operations. This leads to a technique\ncalled miss-clustering that has been the subject of some compiler optimiza-\ntions. The other-stall component would generally be improved because they\nmainly consist of resource stalls, branch mispredictions, and the like. The\nsynchronization component if any will not be sped up much.\n\nb. Memory stall time and instruction miss stall time dominate the execution for\nOLTP, more so than for the other benchmarks. Both of these components are\nnot very well addressed by out-of-order execution. Hence the OLTP workload\nhas lower speedup compared to the other benchmarks with System B.\n\nBecause false sharing occurs when both the data object size is smaller than the\ngranularity of cache block valid bit(s) coverage and more than one data object is\nstored in the same cache block frame in memory, there are two ways to prevent\nfalse sharing. Changing the cache block size or the amount of the cache block cov-\nered by a given valid bit are hardware changes and outside the scope of this exer-\ncise. However, the allocation of memory locations to data objects is a software\nissue.\n\nThe goal is to locate data objects so that only one truly shared object occurs per\ncache block frame in memory and that no non-shared objects are located in the\nsame cache block frame as any shared object. If this is done, then even with just a\nsingle valid bit per cache block, false sharing is impossible. Note that shared,\nread-only-access objects could be combined in a single cache block and not con-\ntribute to the false sharing problem because such a cache block can be held by\nmany caches and accessed as needed without an invalidations to cause unneces-\nsary cache misses.\n\nCopyright \u00a9 2012 Elsevier, Inc. All rights reserved.\n\f58\n\nSolutions to Case Studies and Exercises\n\n5.26\n\n5.27\n\n5.28\n\nTo the extent that shared data objects are explicitly identified in the program\nsource code, then the compiler should, with knowledge of memory hierarchy\ndetails, be able to avoid placing more than one such object in a cache block frame\nin memory. If shared objects are not declared, then programmer directives may\nneed to be added to the program. The remainder of the cache block frame should\nnot contain data that would cause false sharing misses. The sure solution is to pad\nwith block with non-referenced locations.\n\nPadding a cache block frame containing a shared data object with unused mem-\nory locations may lead to rather inefficient use of memory space. A cache block\nmay contain a shared object plus objects that are read-only as a trade-off between\nmemory use efficiency and incurring some false-sharing misses. This optimiza-\ntion almost certainly requires programmer analysis to determine if it would be\nworthwhile. Generally, careful attention to data distribution with respect to cache\nlines and partitioning the computation across processors is needed.\n\nThe problem illustrates the complexity of cache coherence protocols. In this case,\nthis could mean that the processor P1 evicted that cache block from its cache and\nimmediately requested the block in subsequent instructions. Given that the write-\nback message is longer than the request message, with networks that allow out-of-\norder requests, the new request can arrive before the write back arrives at the direc-\ntory. One solution to this problem would be to have the directory wait for the write\nback and then respond to the request. Alternatively, the directory can send out a\nnegative acknowledgment (NACK). Note that these solutions need to be thought\nout very carefully since they have potential to lead to deadlocks based on the partic-\nular implementation details of the system. Formal methods are often used to check\nfor races and deadlocks.\n\nIf replacement hints are used, then the CPU replacing a block would send a hint to\nthe home directory of the replaced block. Such hint would lead the home directory\nto remove the CPU from the sharing list for the block. That would save an invali-\ndate message when the block is to be written by some other CPU. Note that while\nthe replacement hint might reduce the total protocol latency incurred when writing\na block, it does not reduce the protocol traffic (hints consume as much bandwidth\nas invalidates).\n\na. Considering first the storage requirements for nodes that are caches under the\ndirectory subtree:\n\nThe directory at any level will have to allocate entries for all the cache blocks\ncached under that directory\u2019s subtree. In the worst case (all the CPU\u2019s under\nthe subtree are not sharing any blocks), the directory will have to store as\nmany entries as the number of blocks of all the caches covered in the subtree.\nThat means that the root directory might have to allocate enough entries to\nreference all the blocks of all the caches. Every memory block cached in a\ndirectory will represented by an entry, the k-bit\nvector will have a bit specifying all the subtrees that have a copy of the block.\nFor example, for a binary tree an entry means that block m is cached\nunder both branches of the tree. To be more precise, one bit per subtree would\n\nCopyright \u00a9 2012 Elsevier, Inc. At the next level of the hierarchy, the directories\nwill be k times bigger. The number of directories at level i is k\u2019.\n\nTo consider memory blocks with a home in the subtree cached outside the\nsubtree. The storage requirements per directory would have to be modified.\nCalculation outline:\n\nNote that for some directory (for example the ones at level I-1) the number\nof possible home nodes that can be cached outside the subtree is equal to\n(b x ( k! \u2014 x)), where k! is the total number of CPU\u2019s, b is the number of\nblocks per cache and x is the number of CPU\u2019s under the directory\u2019s sub-\ntree. It should be noted that the extra storage diminishes for directories in\nhigher levels of the tree (for example the directory at level 0 does not\nrequire any such storage since all the blocks have a home in that direc-\ntory\u2019s subtree).\n\nb. Simulation.\n\nCopyright \u00a9 2012 Elsevier, Inc. Assume a two processor sys-\ntem with one processor performing multiple writes on the data and the other pro-\ncessor spinning on the synchronization variable. With an invalidate protocol, false\nsharing will mean that every access to the cache line ends up being a miss resulting\nin significant performance penalties.\n\nThe monitor has to be place at a point through which all memory accesses pass.\nOne suitable place will be in the memory controller at some point where accesses\nfrom the 4 cores converge (since the accesses are uncached anyways). The monitor\nwill use some sort of a cache where the tag of each valid entry is the address\naccessed by some load-linked instruction. If there is no matching entry in the\ncache, then a new entry is created (if there is space in the cache). All rights reserved.\n\f5.32\n\nChapter 5 Solutions 61\n\nIf the core numbers are the same, then the matching cache entry is invali-\ndated, the write proceeds to memory and returns a success signal to the\noriginating core. The problem states\nthat L2 has equal or higher associativity than L1, both use LRU, and both have the\nsame block size.\n\n \n\nWhen a miss is serviced from memory, the block is placed into all the caches, i-e.,\nitis placed in L1 and L2. Also, a hit in L1 is recorded in L2 in terms of updating\nLRU information. Another key property of LRU is the following. Let A and B\nboth be sets whose elements are ordered by their latest use. If A is a subset of B\nsuch that they share their most recently used elements, then the LRU element of\nB must either be the LRU element of A or not be an element of A.\n\nThis simply states that the LRU ordering is the same regardless if there are 10\nentries or 100. Let us assume that we have a block, D, that is in L1, but not in L2.\nSince D initially had to be resident in L2, it must have been evicted. At the time\nof eviction D must have been the least recently used block. Since an L2 eviction\ntook place, the processor must have requested a block not resident in L1 and\nobviously not in L2. The new block from memory was placed in L2 (causing the\neviction) and placed in LI causing yet another eviction. L1 would have picked\nthe least recently used block to evict.\n\nSince we know that D is in L1, it must be the LRU entry since it was the LRU\nentry in L2 by the argument made in the prior paragraph. This means that L1\nwould have had to pick D to evict. This results in D not being in L1 which results\nin a contradiction from what we assumed. If an element is in L1 it has to be in L2\n(inclusion) given the problem\u2019s assumptions about the cache.\n\nCopyright \u00a9 2012 Elsevier, Inc. All rights reserved.\n\f62\n\nSolutions to Case Studies and Exercises\n\n5.34\n\n5.35\n\nAnalytical models can be used to derive high-level insight on the behavior of the\nsystem in a very short time. Typically, the biggest challenge is in determining the\nvalues of the parameters. In addition, while the results from an analytical model can\ngive a good approximation of the relative trends to expect, there may be significant\nerrors in the absolute predictions.\n\nTrace-driven simulations typically have better accuracy than analytical models,\nbut need greater time to produce results. The advantages are that this approach\ncan be fairly accurate when focusing on specific components of the system (e.g.,\ncache system, memory system, etc.). However, this method does not model the\nimpact of aggressive processors (mispredicted path) and may not model the\nactual order of accesses with reordering. Traces can also be very large, often tak-\ning gigabytes of storage, and determining sufficient trace length for trustworthy\nresults is important. It is also hard to generate representative traces from one class\nof machines that will be valid for all the classes of simulated machines. It is also\nharder to model synchronization on these systems without abstracting the syn-\nchronization in the traces to their high-level primitives.\n\nExecution-driven simulation models all the system components in detail and is\nconsequently the most accurate of the three approaches. However, its speed of\nsimulation is much slower than that of the other models. All rights reserved.\n\f6.1\n\n6.2\n\nChapter 6 Solutions 63\n\nChapter 6 Solutions\n\nCase Study 1: Total Cost of Ownership Influencing\nWarehouse-Scale Computer Design Decisions\n\na.

Error message

7

Resource Center

Navigation

Who's online

Error message

7

Resource Center

Navigation

User login

Who's online