#
ee92d6ff |
| 25-Apr-2025 |
Yanqin Li <[email protected]> |
fix(StoreQueue): add nc_req_ack state to avoid duplicated request (#4625)
## Bug Discovery The Svpbmt CI of master at https://github.com/OpenXiangShan/XiangShan/actions/runs/14639358525/job/41077890
fix(StoreQueue): add nc_req_ack state to avoid duplicated request (#4625)
## Bug Discovery The Svpbmt CI of master at https://github.com/OpenXiangShan/XiangShan/actions/runs/14639358525/job/41077890352 reported the following implicit output error:
``` check_misa_h PASSED test_pbmt_perf TEST: read 4 Bytes 1000 times
Svpbmt IO test... addr:0x10006d000 start: 8589, end: 59845, ticks: 51256
Svpbmt NC test... addr:0x10006c000 start: 67656, end: 106762, ticks: 39106
Svpbmt NC OUTSTANDING test... smblockctl = 0x3f7 addr:0x10006c000 start: 118198, end: 134513, ticks: 16315
Svpbmt PMA test... addr:0x100000000 start: 142696, end: 144084, ticks: 1388 PASSED test_pbmt_ldld_violate ERROR: untested exception! cause NO: 5 (mhandler, 219) [FORK_INFO pid(1251274)] clear processes... Core 0: HIT GOOD TRAP at pc = 0x80005d64 Core-0 instrCnt = 174,141, cycleCnt = 240,713, IPC = 0.723438 ```
## Design Background For NC (Non-Cacheable) store operations, the handshake logic between the StoreQueue and Uncache is as follows:
1. **Without Outstanding Enabled:** In the `nc_idle` state, when an executable `nc store` is encountered, it transitions to the `nc_req` state. After `req.fire`, it moves to the `nc_resp` state. Once `resp.fire` is triggered, it returns to `nc_idle`, and both `rdataPtrExtNext` and `deqPtrExtNext` are updated to handle the next request.
2. **With Outstanding Enabled:** In the `nc_idle` state, upon encountering an executable `nc store`, it transitions to the `nc_req` state. After `req.fire`, it **returns to `nc_idle`** (Point A). Once the request is fully written into Uncache, i.e., upon receiving `ncSlaveAck` (Point B), it updates `rdataPtrExtNext` and `deqPtrExtNext` to handle the next request.
## Bug Description In the above scenario, since the transition to `nc_idle` at Point A occurs earlier (by two cycles) than Point B due to timing differences, the `rdataPtr` at Point A still points to the location of the previous uncache request (let’s call it NC1). The condition for sending uncache request is still met at this moment, leading Point A to issue a **duplicate `uncache` request** for NC1.
By the time Point B occurs, **two identical requests for NC1** have already been sent. At Point B, `rdataPtr` is updated to proceed to the next request. However, when the **second `ncSlaveAck`** for NC1 returns, `rdataPtr` is updated **again**, causing it to move forward **twice** for a single request. This eventually results in one of the following requests never being executed.
## Bug Fix Given that multiple cycles are required to ensure that a request is fully written to Uncache, a new state called `nc_req_ack` is introduced. The revised handshake logic with outstanding enabled is as follows:
In the `nc_idle` state, when an executable `ncstore` is encountered, it transitions to the `nc_req` state. After `req.fire`, it moves to the `nc_req_ack` state. Once the request is fully written to Uncache and `ncSlaveAck` is received, it transitions back to `nc_idle`, and updates `rdataPtrExtNext` and `deqPtrExtNext` to handle the next request.
show more ...
|
#
ce78e60c |
| 21-Apr-2025 |
Anzo <[email protected]> |
fix(StoreQueue): remove `cboZeroUop` saved `sqptr` (#4591)
|
#
724e3eb4 |
| 10-Apr-2025 |
Yanqin Li <[email protected]> |
fix(StoreQueue): keep readPtr until slave ack when outstanding (#4531)
|
#
1592abd1 |
| 08-Apr-2025 |
Yan Xu <[email protected]> |
feat: support inst lifetime trace (#4007)
PerfCCT(performance counter commit trace) is a Instruction-level granularity perfCounter like GEM5 How to use this: 1. Make with "WITH_CHISELDB=1" argument
feat: support inst lifetime trace (#4007)
PerfCCT(performance counter commit trace) is a Instruction-level granularity perfCounter like GEM5 How to use this: 1. Make with "WITH_CHISELDB=1" argument 2. Run with "--dump-db --dump-select-db lifetime", then get the database 3. Instruction lifetime visualize run "python3 scripts/perfcct.py "the-db-file-path" -p 1 -v | less" 4. Analysis script now is in XS-GEM5 repo, see https://github.com/OpenXiangShan/GEM5/blob/xs-dev/util/ClockAnalysis.py
How it works: 1. Allocate one unique tag "seqNum" like GEM5 for each instruction at fetch stage 2. Passing the "seqNum" in each pipeline 3. Recording perf data through the DPIC interface
show more ...
|
#
4e7fa708 |
| 27-Feb-2025 |
zhanglinjuan <[email protected]> |
fix(StoreQueue): cbo.zero is written to sbuffer only if allocated (#4316)
For misalign store that crosses 16-byte boundary, a store would write sbuffer twice in one cycle but only takes up one SQ en
fix(StoreQueue): cbo.zero is written to sbuffer only if allocated (#4316)
For misalign store that crosses 16-byte boundary, a store would write sbuffer twice in one cycle but only takes up one SQ entry. If there is only one misalign store in SQ, `isCboZeroToSbVec`, which is used to check if there is any cbo.zero written to sbuffer based on `fuOpType` in `uop`, may apply wrong `fuOpType` in an empty SQ entry, or lead to X-state propogation in VCS simulaition.
show more ...
|
#
1eb8dd22 |
| 24-Feb-2025 |
Kunlin You <[email protected]> |
submodule(utility), XSDebug: support collecting missing XSDebug (#4251)
Previous in PR#3982, we support collecting XSLogs to LogPerfEndpoint.
However with --enable-log, we should also collect some
submodule(utility), XSDebug: support collecting missing XSDebug (#4251)
Previous in PR#3982, we support collecting XSLogs to LogPerfEndpoint.
However with --enable-log, we should also collect some missing XSDebug.
This change move these missing XSDebug outside WhenContext, and add
WireInit to LogUtils' apply, to enable probing some subaccessed data,
like a vec elem with dynamic index.
show more ...
|
#
a7904e27 |
| 24-Feb-2025 |
Anzo <[email protected]> |
fix(StoreQueue): fix threshold condition for fore write sbuffer (#4306)
Previously, `ForceWrite` was conditioned to write dead (60, 55), which no longer applies after we adjusted `StoreQueueSize`.
fix(StoreQueue): fix threshold condition for fore write sbuffer (#4306)
Previously, `ForceWrite` was conditioned to write dead (60, 55), which no longer applies after we adjusted `StoreQueueSize`.
---
Now a more reasonable parameterized setting is used. However, the conditions for optimal performance still need to be tested.
show more ...
|
#
3c808de0 |
| 17-Feb-2025 |
Anzo <[email protected]> |
fix(LSU): fix cbo instr exceptions and implementation (#4262)
1. typo.
2. `cbo` instr not produce misaligned exception.
3. `cbo zero` instr need flush `sbuffer`.
4. `cbo zero` sets mask correctly
fix(LSU): fix cbo instr exceptions and implementation (#4262)
1. typo.
2. `cbo` instr not produce misaligned exception.
3. `cbo zero` instr need flush `sbuffer`.
4. `cbo zero` sets mask correctly
5. Adding RAW checks to `cbo zero`.
6. Adding trigger(Debug Mode) checks to `cbo zero`.
7. Fixed several issues with the CBO instruction in NEMU.
----
In order not to create ambiguity with `io.mmioStout`, a new port of
`StoreQueue` is introduced for writeback `cbo zero` after flush sbuffer.
arbitration is performed in `MemBlock`, and currently, `cbo zero` has
higher priority by default.
`cbo zero` should not be writteback at the same time as `mmio`.
---
A check on `CacheLine` has been added to `RAWQueue` to ensure memory
consistency when executing `cbo zero`.
See this issues:https://github.com/OpenXiangShan/XiangShan/issues/4240
for specific issues.
---
The `cbo` instruction requires a trigger check.
---------
Co-authored-by: zhanglinjuan <[email protected]>
show more ...
|
#
9e12e8ed |
| 08-Feb-2025 |
cz4e <[email protected]> |
style(Bundles): move bundles to Bundles.scala (#4247)
|
#
74050fc0 |
| 26-Jan-2025 |
Yanqin Li <[email protected]> |
perf(Uncache): add merge policy when entering (#4154)
# Background
## Problem
How to design a more efficient entry rule for a new load/store request when a load/store with the same address already
perf(Uncache): add merge policy when entering (#4154)
# Background
## Problem
How to design a more efficient entry rule for a new load/store request when a load/store with the same address already exists in the `ubuffer`?
* **Old Design**: Always **reject** the new request. * **New Desig**n: Consider **merging** requests.
## Merge Scenarios
‼️If the new one can be merge into the existing one, both need to be `NC`.
1. **New Store Request:** 1. **Existing Store:** Merge (the new store is younger). 2. **Existing Load:** Reject.
2. **New Load Request:** 1. **Existing Load:** Merge (the new load may be younger or older. Both are ok to merge). 2. **Existing Store:** Reject.
# What this PR do?
## 1. Entry Actions
1. **Allocate** a new entry and mark as `valid` 1. When there is no matching address. 2. **Allocate** a new entry and mark as `valid` and `waitSame`: 1. When there is a matching address, and: * The virtual addresses and attributes are the same. * The older entry is either selected to issue or issued. 3. **Merge** into an Existing Entry: 1. When there is a matching address, and: * The virtual addresses and attributes are the same. * The older entry is **not** selected to issue or issued. 4. **Reject** the New Request: 1. When the ubuffer is full. 2. When there is a matching address, but: * The virtual addresses or attributes are **different**.
**NOTE:** According to the definition in the TL-UL SPEC, the `mask` must be continuous and naturally aligned, and the `addr` must correspond to the mask. Therefore, the "**same attributes**" here introduces a new condition: the merged `mask` must meet the requirements of being continuous and naturally aligned (function `continueAndAlign`). During merging, the block offset of addr must be synchronously updated in `UncacheEntry.update`.
## 2. Handshake Mechanism Between `LoadQueueUncache (M)` and `Uncache (S)`
> `mid`: master id > > `sid`: slave id
**Old Design:**
- `M` sends a `req` with a **`mid`**. - `S` receives the `req`, records the **`mid`**. - `S` sends a `resp` with the **`mid`**. - `M` receives the `resp` and matches it with the recorded **`mid`**.
**New Design:**
- `M` sends a `req` with a **`mid`**. - `S` receives the `req` and responds with `{mid, sid}` . - `M` matches it with the **`mid`** and updates its record with the received **`sid`**. - `S` sends a `resp` with the its **`sid`**. - `M` receives the `resp` and matches it with the recorded **`sid`**.
**Benefit:** The new design allows `S` to merge requests when new request enters.
## 3. Forwarding Mechanism
**Old Design:** Each address in the `ubuffer` is **unique**, so forwarding is straightforward based on a match.
**New Design:**
* A single address may have up to two entries matched in the `ubuffer`. * If it has two matched enties, it must be true that one entry is marked `inflight` and the other entry is marked `waitSame`. In this case, the forwarded data comes from the merged data of two entries, with the `inflight` entry being the older one.
## 4. Bug Fixes
1. In the `loadUnit`, `!tlbMiss` cannot be directly used as `tlbHit`, because when `tlbValid` is false, `!tlbMiss` can still be true. 2. `Uncache` state machine transition: The state indicating "**able to send requests**" (previously `s_refill_req`, now `s_inflight`) should not be triggered by `reqFire` but rather by `acquireFire`.
<img width="747" alt="image" src="https://github.com/user-attachments/assets/75fbc761-1da8-43d9-a0e6-615cc58cefef" />
# Evaluation
- ✅ timing - ✅ performance
| Type | 4B*1000 | Speedup1-IO | 1B*4096 | Speedup2-IO | | -------------- | ------- | ----------- | ------- | ----------- | | IO | 51026 | 1 | 208149 | 1.00 | | NC | 42343 | 1.21 | 169248 | 1.23 | | NC+OT | 20379 | 2.50 | 160101 | 1.30 | | NC+OT+mergeOpt | 16308 | 3.13 | 126369 | 1.65 | | cache | 1298 | 39.31 | 4410 | 47.20 |
show more ...
|
#
1abade56 |
| 22-Jan-2025 |
Anzo <[email protected]> |
fix(LSU): fix cbo instruction exception handling logic (#4215)
|
#
14651e98 |
| 07-Jan-2025 |
Anzo <[email protected]> |
fix(StoreQueue): remove the incorrect redirect logic (#4139)
|
#
30bd4482 |
| 30-Dec-2024 |
Anzo <[email protected]> |
fix(LSQ): fix 'enqCancelNum' bit width (#4109)
|
#
c2acf9ea |
| 30-Dec-2024 |
Anzo <[email protected]> |
fix(StoreQueue): fix `vecLastFlow` set logic (#4105)
|
#
be8e95bc |
| 25-Dec-2024 |
Anzo <[email protected]> |
fix(MemBlock): fix overflow during lsqptr calculation (#4084)
The addition used previously to calculate the `lsq` pointer results in overflow, this is because, the bit width of `numLsElem` is 5 and
fix(MemBlock): fix overflow during lsqptr calculation (#4084)
The addition used previously to calculate the `lsq` pointer results in overflow, this is because, the bit width of `numLsElem` is 5 and multiple uop accumulations result in data overflow.
---
Theoretically this would have been a problem in previous versions as well, but for some reason the bug didn't occur in previous versions until `newDispatch`.
show more ...
|
#
519244c7 |
| 25-Dec-2024 |
Yanqin Li <[email protected]> |
submodule(CoupledL2, OpenLLC): support pbmt in CHI scene (#4071)
* L1: deliver the NC and PMA signals of uncacheReq to L2 * L2: [support Svpbmt on CHI MemAttr](https://github.com/OpenXiangShan/Coupl
submodule(CoupledL2, OpenLLC): support pbmt in CHI scene (#4071)
* L1: deliver the NC and PMA signals of uncacheReq to L2 * L2: [support Svpbmt on CHI MemAttr](https://github.com/OpenXiangShan/CoupledL2/pull/273) * LLC: [Non-cache requests are forwarded directly downstream without entering the slice](https://github.com/OpenXiangShan/OpenLLC/pull/28)
show more ...
|
#
8b33cd30 |
| 13-Dec-2024 |
klin02 <[email protected]> |
feat(XSLog): move all XSLog outside WhenContext for collection
As data in WhenContext is not acessible in another module. To support XSLog collection, we move all XSLog and related signal outside Wh
feat(XSLog): move all XSLog outside WhenContext for collection
As data in WhenContext is not acessible in another module. To support XSLog collection, we move all XSLog and related signal outside WhenContext. For example, when(cond1){XSDebug(cond2, pable)} to XSDebug(cond1 && cond2, pable)
show more ...
|
#
5de026b7 |
| 17-Dec-2024 |
Anzooooo <[email protected]> |
fix(LSQ): modify the enq logic
This commit modifies the previous silly queue entry. This greatly reduces the generated verilog, making: StoreQueue verilog in StoreQueue from 26W lines -> 5W lines ve
fix(LSQ): modify the enq logic
This commit modifies the previous silly queue entry. This greatly reduces the generated verilog, making: StoreQueue verilog in StoreQueue from 26W lines -> 5W lines verilog in VirtualLoadQueue from 13W lines -> 2W lines
Also, we can no longer limit the number of numLsElem per `io.enq`.
show more ...
|
#
562eaa0c |
| 15-Dec-2024 |
Anzooooo <[email protected]> |
fix(MemBlock): fix misaligned exception and remove redundant reg from `SQ`
|
#
909ea138 |
| 16-Dec-2024 |
Anzo <[email protected]> |
fix(LSQ): modify misaligned `forward fault` detection (#4038)
Previously, I used an inappropriate way for another misalign to trigger a `forward fault`:
https://github.com/OpenXiangShan/XiangShan/
fix(LSQ): modify misaligned `forward fault` detection (#4038)
Previously, I used an inappropriate way for another misalign to trigger a `forward fault`:
https://github.com/OpenXiangShan/XiangShan/blob/38d0d7c5a34a23dfdb58a3cb2737c3cfddb3ec9d/src/main/scala/xiangshan/mem/lsqueue/StoreQueue.scala#L684-L711
This would cause the `BlockSqIdx` passed to `LoadQueueReplay` to use the `sqIdx` from `uop` instead of the `sqIdx` with the unalign flag bit:
https://github.com/OpenXiangShan/XiangShan/blob/38d0d7c5a34a23dfdb58a3cb2737c3cfddb3ec9d/src/main/scala/xiangshan/mem/lsqueue/StoreQueue.scala#L776-L782
**This leads to a possible stuck in `LoadQueueReplay`.**
And to resolve the stuck, we incorrectly introduced this Commit(af757d1b973e03dae3ce0078a4a8249b593188ec).
This Commit(af757d1b973e03dae3ce0078a4a8249b593188ec) causes `BlockSqIdx` to unblock without `DataValid`. This leads to certain performance issues.
This revision fixes the inappropriate `forward fault` triggering method and reverses the Commit(af757d1b973e03dae3ce0078a4a8249b593188ec).
**This should bring performance back up again.** ### Apologies for my mistake.
show more ...
|
#
4fb7cc17 |
| 16-Dec-2024 |
cz4e <[email protected]> |
timing(StoreQueue): cmoReq.address add 1 latch (#3988)
|
#
99baa882 |
| 13-Dec-2024 |
Anzo <[email protected]> |
fix(StoreQueue): fix the `vecExceptionFlag` setting condition (#4037)
Only if `dataBuffer.io.enq.fire` is considered to have `deq`
|
#
433cc30b |
| 12-Dec-2024 |
Anzo <[email protected]> |
fix(StoreQueue): fix `difftestinfo` for store event (#4027)
The acquisition of information related to the difftest when a
non-aligned Store is split into a Sbuffer was not considered before. Use
a
fix(StoreQueue): fix `difftestinfo` for store event (#4027)
The acquisition of information related to the difftest when a
non-aligned Store is split into a Sbuffer was not considered before. Use
a more robust way to get the information needed for the difftest.
show more ...
|
#
2159ac24 |
| 09-Dec-2024 |
Anzooooo <[email protected]> |
fix(selectOldest): use `===` instead of `isNotBefore`
For instructions with vectors or other multiple `uop`, it is necessary to determine whether `robIdx` is the same before comparing `uopIdx`. Alth
fix(selectOldest): use `===` instead of `isNotBefore`
For instructions with vectors or other multiple `uop`, it is necessary to determine whether `robIdx` is the same before comparing `uopIdx`. Although there is no error if `isNotBefore` is used, we can use the clearer and more concise `===` to make the determination.
show more ...
|
#
b240e1c0 |
| 07-Nov-2024 |
Anzooooo <[email protected]> |
feat(Zicclsm): refactoring misalign and support vector misalign
|