US20040174814A1 - Register based remote data flow control - Google Patents

Register based remote data flow control Download PDF

Info

Publication number
US20040174814A1
US20040174814A1 US10/759,974 US75997404A US2004174814A1 US 20040174814 A1 US20040174814 A1 US 20040174814A1 US 75997404 A US75997404 A US 75997404A US 2004174814 A1 US2004174814 A1 US 2004174814A1
Authority
US
United States
Prior art keywords
node
data
send
host
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/759,974
Inventor
William Futral
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corp filed Critical Intel Corp
Priority to US10/759,974 priority Critical patent/US20040174814A1/en
Assigned to INTEL CORPORATION reassignment INTEL CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: FUTRAL, WILLIAM T.
Publication of US20040174814A1 publication Critical patent/US20040174814A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/382Information transfer, e.g. on bus using universal interface adapter
    • G06F13/385Information transfer, e.g. on bus using universal interface adapter for adaptation of a particular data processing system to different peripheral devices

Definitions

  • the invention relates generally to methods and apparatus for data communications across a network.
  • the invention relates to methods and apparatus for register based remote data flow control over a channel-based switching fabric interconnect or the like.
  • an Input/Output (I/O) node functioning as an intermediary between a host computer and a Local Area Network (LAN) consists of a bus master network interface card (NIC). This process is shown generally in FIG. 1.
  • the I/O controller in the NIC is provided with specific information (i.e., descriptor or memory token) about each buffer in a list of buffers set up and maintained in the host.
  • the NIC Every time the NIC receives one or more packets from the LAN (step 1 ), it reads the buffer list (step 2 ) and uses a Direct Memory Access (DMA) write operation across the bus to put the packet(s) into the next receive buffer(s) on the list (step 3 ) and send a notification of the placement of the packet(s) in the buffer(s) to the host (step 4 ).
  • DMA Direct Memory Access
  • the buffer(s) is emptied or a driver allocates a number of buffers (i.e., by placing another buffer(s) at the end of the list of buffers) in memory and sends a notification to the NIC of the information identifying each of the buffers (step 5 ).
  • the I/ 0 controller of the NIC must continually read the buffer list and manage the information about the pool of host buffers in order to be able to write data into the proper buffer as LAN packets are received.
  • This process is shown generally in FIG. 1.
  • the process also increases the number of operations required to transfer data from the NIC to the host computer. In addition to the data transfer operation itself, the NIC must also send a separate notification to the host computer. Since the data transfer operation is a DMA write operation, there is typically a response sent back acknowledging the successful transfer of data. These additional operations also increase the load on the bus connecting the host computer and the NIC.
  • Such a process also leads to complexity and latencies in the I/O controller of the NIC.
  • the I/ 0 controller must continuously receive and store the information concerning all of the buffers being set up, emptied and allocated in the host. It must also continuously maintain and refer to the list of host buffers in order to determine the proper buffer for data to be transferred into and to attach the corresponding buffer identifying information as overhead in the RDMA write operation transferring the data to that buffer.
  • There can be significant latencies because of the several different operations across the bus and the processing of the buffer information in the I/O controller.
  • the host may be busy processing other tasks when it gets notified of the RDMA write operation and not realize that all of the buffers are full or close to full and that additional buffers need to be posted.
  • the NIC may continue to receive LAN packets. If additional buffers are not posted to the bottom of the list of buffers in time, then all of the buffers may be consumed before the host responds to the notification. In such an event, there is an overflow at the NIC and the LAN packets have to be discarded. While the host node may re-request the lost data, it causes more LAN traffic which in turn increases the latency (and decreases the performance and efficiency) of the NIC when transferring data from the LAN to the host computer. Although additional buffering may be used to offset these effects to some extent, it increases the cost of the NIC, an important consideration in the LAN environment.
  • the present invention is directed to methods and apparatus for data communications across a network.
  • the first step of the method is to store a value in a register in the I/O node which is indicative of a number of send credits available to the I/O node. It is then determined from the value of the register whether or not there is a sufficient number of send credits available to the I/O node for the data to be transferred. If a sufficient number of send credits is available to the I/O node, it promptly transfers the data to the host over the channel-based switching fabric interconnect using send/receive semantics. If a sufficient number of send credits is not available to the I/O node, it waits for the host to update the value stored in the register before transferring the data.
  • FIG. 1 is a block and flow diagram illustrating a conventional method of transferring data from a network interface card to a host computer.
  • FIG. 2 is a block diagram illustrating the NGIO/VI Architectural model used in an example embodiment of the invention.
  • FIG. 3 is a block diagram of the host and I/O node in an example embodiment of the invention.
  • NGIO Next Generation Input/Output
  • V Virtual Interface
  • TCP Transmission Control Protocol
  • NGIO/VI channel-based switching fabric interconnect conventionally does not provide transport level functionality such as flow control, buffer management, fragmentation and reassembly.
  • TCP uses a sliding window flow control protocol incorporating sequence numbers, positive acknowledgments, error and duplicate detection, timeouts and retransmission of lost packets, etc., because the underlying network is presumed to be inherently unreliable.
  • NGIO/VI channel-based switching fabric interconnects have very low error rates and high reliability levels (delivery and reception) and consider transport errors catastrophic for reliable data delivery mode.
  • the channel connection is broken in the rare case of a lost packet or transport error. Since the virtual interface guarantees that data is delivered exactly once in order, many of the functions performed by TCP to ensure reliability are redundant and would add unnecessary overhead.
  • the host computer and I/O node can be connected by a NGIO/VI channel-based switching fabric having low latency and high bandwidth
  • the effective data transfer performance across that switching fabric can be less than optimum because of the lack of flow control and buffer management.
  • This is especially true in computing clusters having an I/O node connected to a local area network (LAN) or other bursty, asynchronous, network where the amount of network traffic can increase or decrease suddenly and/or the data transfers can vary greatly in size and type from large pre-recorded contiguous blocks of image data, such as multimedia data from a CD-ROM, to much smaller heavily fragmented user data.
  • LAN local area network
  • the LAN packets received by the I/O nodes can range in maximum size anywhere from 1500 bytes to over 64000 bytes.
  • the manner in which the data packets are buffered and transferred by I/O nodes, host and other elements in the computing cluster can be crucial. Therefore, a need exists for a data communication service over a channel-based switching fabric interconnect that overcomes the disadvantages of conventional PCI compliant LAN NICs discussed above with respect to FIG. 1., yet still provides flow control and buffer management for data transfer between devices of a computing cluster connected by the switching fabric.
  • the example embodiment of the present invention is applied to a host computer and I/O node of a computing cluster connected to each other over a NGIO/VI channel-based switching fabric.
  • the host computer has a processor, associated system memory with a plurality of allocated and configured buffers, and at least one internal bus connecting these components. It uses the VI Architectural Model which will be described shortly.
  • the invention may be implemented in conjunction with other different channel-based switching fabric interconnects having messaging abilities.
  • the example embodiment and other embodiments of the invention may utilize any other architecture and channel-based interconnect which supports both message passing and remote direct memory access, such as the System I/O (SIO) architecture currently being developed as a standardization of NGIO with other architectures.
  • SIO System I/O
  • message passing refers to the transfer of data from one end of a channel to the other end wherein the unit receiving the data determines the desired location of the transferred data in its memory.
  • remote direct memory access (RDMA) operations allow the initiating end of a data transfer operation to identify the memory location at the receiving end of the channel where the data will be read or stored at the completion of the data transfer operation.
  • a channel is any means of transferring data, including but not limited to virtual channels, used to transfer data between two endpoints. While the example embodiment is an NGIO implementation and this channel definition is provided in the NGIO specification identified above, the present invention is not so limited.
  • NGIO the smallest possible autonomous unit of data is called a cell, and a packet is made up of a number of such cells.
  • SIO uses the term “packet” to describe the smallest possible autonomous unit of data instead of “cell” as in NGIO, and uses the term “message” instead of “packet”.
  • an SIO packet differs slightly from the corresponding NGIO cell.
  • An NGIO cell has a fixed header size and a fized maximum payload of 256 bytes.
  • An SIO packet has several headers of fixed length, but which are only conditionally present in the packet.
  • the payload of an SIO packet is a minimum of 256 bytes and the maximum payload is variable and negotiable.
  • the example embodiment of the invention is directed in part to a network interface card (NIC) connected to a local area network.
  • NIC network interface card
  • An I/O node refers generally to any device or controller that connects a host device or cluster to a network fabric.
  • the example embodiment of the invention is implemented and particularly well suited for data from a local area network, the invention is not so limited in its application.
  • Other embodiments of the invention may be implemented for other networks, especially asynchronous, bursty, networks having widely varying and fluctuating data traffic which is not requested by the receiving host computer.
  • the traffic studied in local area networks typically has fragmented data in the first 64 bytes of transferred packets.
  • There are inefficiencies in the fragmentation and reassembly because the data cannot be processed at the receiving end until the last cell containing an element of the data is received.
  • Large transfers, in particular, can hold up resources for a long time since there must be a validation that the entire payload is without uncorrectable errors. (Error correction information may be included in the cells in addition to the header and payload.)
  • the VI Architectural model includes a VI consumer 8 and a VI provider 24 .
  • a VI consumer 8 is a software process that communicates using a Virtual Interface (VI).
  • the VI consumer 8 typically includes an application program 10 , an operating system communications facility 12 (e.g., Sockets, Remote Procedure Call or RPC, MPI) and a VI user agent 14 .
  • the VI provider 24 includes the combination of a VI network interface controller (VI NIC) 18 and a VI kernel agent 16 . It connects to the NGIO channel-based switching fabric through a channel adapter 30 .
  • VI NIC VI network interface controller
  • VI NIC 18 can directly access memory for data transfer operations with the channel-based switching fabric.
  • the work queues store one or more descriptors 23 between the time it is Posted (placed in the queue) and the time it is Done (when the VI NIC has completed processing it).
  • the descriptor 23 is a data structure recognizable by the VI NIC that describes a data movement request, and it includes a list of segments (a control segment, an optional address segment and one or more data segments).
  • the control segment identifies the type of VI NIC data movement operation to be performed and the status of a completed NIC data movement operation.
  • the data segment describes a communications buffer for the data transfer operations.
  • a receive queue 19 contains descriptors that describe where to place incoming data.
  • a send queue 21 contains descriptors that describe the data to be transmitted.
  • a pair of Vis are associated using connection primitives (e.g., VipConnectWait, VipConnectAccept, VipConnectRequest) to allow packets sent at one VI to be received at the other VI.
  • a send doorbell (not shown) and a receive doorbell (not shown) are provided for allowing the VI consumer to notify the VI NIC 18 that work (a descriptor describing a requested data transfer operation) has been placed in the send queue 21 and receive queue 19 , respectively.
  • the VI user agent 14 is a software component that enables an operating system communication facility 12 to utilize a particular VI provider 24 .
  • the VI user agent abstracts the details of the underlying VI NIC hardware in accordance with an interface defined by an operating system communication facility 12 .
  • the VI user agent includes a library of primitives known as the VI primitives library (VIPL) that provide functions for creating a VI (VipCreateVI), for destroying a VI (VipDestroyVI), for connecting one VI to another VI (e.g., VipConnectWait, VipConnectRequest), for accepting or rejecting a VI connection request (VipConnectAccept or VipConnectReject), for terminating or disconnecting a connection between two VIs (VipDisconnect), to allow a process to register process memory with a VI NIC (VipRegisterMem), to post descriptors (to place a descriptor in a VI work queue using, e.g., VipPostSend, VipPostRecv), etc.
  • VI primitives VI primitives library
  • the kernel agent 16 is the privileged part of the operating system, usually a driver supplied by the VI NIC vendor, that performs the setup and resource management functions. These functions include connection setup/teardown, interrupt management and or processing, management of system memory used by the VI NIC and error handling.
  • VI consumers access the kernel agent 16 using the standard operating system mechanisms such as system calls.
  • the OS communication facility 12 makes system calls to the VI kernel agent 16 to perform several control operations, including to register memory.
  • the VI architecture requires the VI consumer to register memory to be used for data transfer prior to submitting the request for data transfer.
  • the memory regions used by descriptors and data buffers are registered prior to data transfer operations. Memory registration gives a VI NIC a method to translate a virtual address to a physical address.
  • the user receives an opaque memory handle as a result of memory registration. This allows a user to refer to a memory region using a memory handle/virtual address pair without worrying about crossing page boundaries and keeping track of the virtual address to tag mapping.
  • Memory registration enables the VI provider to transfer data directly between the registered buffers of a VI consumer and the channel-based switching fabric.
  • operating system communication facility 12 can use data transfer primitives of the VIPL library of VI user agent 14 to send and receive data.
  • the VI Architecture defines two types of data transfer operations: 1) send/receive message passing, and 2) RDMA read/write operations.
  • the operating system facility 12 posts the application's send and receive requests directly to the send and receive queues.
  • the descriptors are posted (e.g., placed in a work queue) and then a doorbell is rung to notify the NIC that work has been placed in the work queue.
  • the doorbell can be rung (and the VI NIC 18 notified of the work in the queue) without kernel processing.
  • the VI NIC 18 then processes the descriptor by sending or receiving data, and then notifies the VI User Agent 14 of the completed work using the completion queue 22 .
  • the VI NIC 18 directly performs the data transfer functions in response to the posted descriptors.
  • the NGIO/VI Architecture supports an unacknowledged class of service at the NIC level. However, it does not perform other transport level functions, including flow control and buffer management.
  • the VI Architecture Specification, version 1.0, Dec. 16, 1997 states at page 15 that “VI consumers are responsible for managing flow control on a connection.”
  • the present invention is designed to provide data flow control over the NGIO/VI architecture or similar architecture.
  • Host computer 300 has a device driver 301 configured according to the VI model described above, a host processor (CPU) 302 controlling operation of host computer 300 and a system memory 303 coupled to host processor 302 via a host bus.
  • the device driver 301 is coupled to host memory 303 and to host processor 302 . It has send/receive descriptors and information for credit based flow control.
  • a host channel adapter (HCA) 304 connects host computer 300 to NGIO switching fabric 305 .
  • a portion of system memory 303 is allocated for a plurality of send buffers 303 - 1 and receive buffers 303 - 2 , which are registered with device driver 301 .
  • device driver 301 can transfer incoming data directly from HCA 304 to a receiver buffer 303 - 2 , and outgoing data can be directly transferred from a send buffer 303 - 1 to HCA 304 . Pools of associated send and receive descriptors are also created and registered in device driver 301 .
  • the switching fabric may contain many different switches SW and redundant paths (not shown) throughout the fabric, such that a plurality of messages can be traveling through the switching fabric at any given time.
  • the switched fabric configuration can contain a plurality of channel adapters, such that there can be a multitude of different messages traveling through the fabric and where all of the various connected devices can continue operating while their messages are traveling through the switching fabric.
  • I/O node 310 is connected to NGIO switching fabric 305 through target channel adapter (TCA) 311 and to LAN 320 through a conventional LAN receive engine 313 .
  • I/O node 310 includes an I/O controller 312 configured according to the VI model described above.
  • I/O controller 312 includes a credit register 314 storing credits indicating the number of receive buffers available in host computer 300 .
  • Device driver 301 is responsible for managing data flow between host computer 300 and I/O node 310 over a channel in NGIO switching fabric 305 .
  • the data transfers are optimized through the device driver 301 and I/O controller 312 at all times. This helps avoid the processor or other elements of the host computer 300 or I/O node 310 from having to expend system resources to accomplish transfer of data blocks since there may be access conflicts with other functions.
  • This method results in an immediate advantage compared to the conventional method shown in FIG. 1 which must have several operations across the fabric, plus a direct memory access (step 2 ) to place the data in a receive buffer 303 - 2 of host 300 .
  • the host channel adapter 304 and target channel adapter 311 provide all of the scatter/gather capability in the NGIO hardware so that the data is immediately delivered to the target as one contiguous block of data when possible.
  • data can be transferred in one or more data packets.
  • the individual data packets are successively transferred according to the same register based flow control scheme as intact data.
  • a memory token is transferred to device driver 301 .
  • the memory token provides host 300 with access to credit register 314 .
  • the memory token can be of any format, e.g., simply a series of bits indicating the address of the remaining left-over data in memory of host computer 300 .
  • the memory token consists of a virtual address and a memory handle. The virtual address is determined by the I/O controller 312 and when received as part of a RDMA read operation, it is converted by a translation table into a physical address in memory corresponding to credit register 314 .
  • the I/O controller 312 may require that the memory handle accompanying the RDMA read operation is the same as that provided by it to ensure that the initiator of the RDMA read operation is entitled to access to the data.
  • the memory handle may also indicate the authorization of the RDMA read operation to access the credit register 314 .
  • An important advantage of the example embodiment is that only credit register 314 and a single memory token need to be provided rather than the entire buffer list in the conventional system in FIG. 1.
  • step ( 1 ) in FIG. 3 A key feature of this example embodiment is that I/O controller 312 does not have to manage a buffer list or buffer information. Indeed, the send credits in credit register 314 are updated by host 300 without any participation by I/O controller 312 .
  • CPU 302 schedules the buffer set up, and corresponding RDMA write operations, at a rate consistent with the resources on host 300 .
  • a key advantage of this example embodiment is the efficiency with which the host 300 can use its resources.
  • Host computers, especially servers typically have many gigabytes of memory and a large amount of data that is being transferred to and from a network. But the amount of memory on an I/O node is relatively small in comparison.
  • the granularity of cells passed back and forth in the NGIO switching fabric allows the example embodiment to optimize the use of receive buffers in the host 300 .
  • I/O controller 312 counts the number of transfers in counter 316 .
  • the I/O controller 312 can immediately transfer data whenever credit register 314 is greater than counter 316 . Even though the data arrives asynchronously and unexpectedly from the LAN, it can be promptly forwarded to host 300 since there is no need for complicated processing. Conversely, the I/O controller 312 stops transferring data when host 300 consumes all of the registered receive buffers 303 - 2 . As a result, data flow control can be simply and remotely established by host 300 .
  • the data flow control also allows corrupted packets to be silently dropped between I/O node 310 and host 300 .
  • By placing the counter value in the send message itself allows the host to detect when it has missed packets. This increases efficiency and accomodates for the receive buffer 303 - 2 that did not get consumed by the defective packet. Every time host 300 detects a gap in count values in send messages, it also updates the credit register 314 so that it increases the number of send credits so that the buffer which was not filled can be used for another data packet.
  • the data flow control is set up by host channel adapter 304 and target channel adapter 311 over a channel with an unacknowledged class of service. If data packets were corrupted on an acknowledged channel, the data transfers would have to be stopped and restarted to compensate. If all of the data packets are held during that period of time, there would be a tremendous buffer requirement.
  • the example embodiment uses the messaging ability of the NGIO/VI architecture to send simple send messages over the unacknowledged channel. Consequently, I/O node 310 does not have to know anything about host 300 or the memory address location destination of the data. Instead of doing a RDMA write across the channel-based switching fabric, it uses a send message to a particular queue pair set up by the virtual interface in device driver 301 that has receive buffers 303 - 2 associated with corresponding receive queues. So as the data comes across the channel-based switching fabric, it goes into those buffers automatically. The only element that needs to know the particulars of the process is the virtual interface in device driver 301 , not even the host channel adapter 304 and target channel adapter 311 , although the data transfer is fairly simple.
  • the invention is not limited to the example embodiment illustrated in FIG. 3. Indeed, an advantage of the invention is that it is particularly useful and widely adaptable to any I/O device having latency in data transfer operations. In this way, data transfers can be efficient in both a server that has a great deal of network I/O interfaces and other interfaces.
  • the example embodiments will automatically adapt to transfer characteristics in which large blocks of data are generally asynchronously transferred as well as small blocks of data. Indeed, the example embodiments will adapt to any I/O data interface.

Abstract

In a method according to an example embodiment of the invention, a data packet is transferred from an I/O node to a host across a channel-based switching fabric interconnect. The method stores a value in a register in the I/O node which is indicative of a number of send credits available to the I/O node. The I/O node keeps a count of the number of data transfers. It is then determined from the value of the register whether or not a sufficient number of send credits is available to the I/O node for the data to be transferred by comparing it with the count of previous data transfers. If a sufficient number of send credits is available to the I/O node, it promptly transfers the data to the host over the channel-based switching fabric interconnect. If a sufficient number of send credits is not available to the I/O node, it waits for the host to update the value stored in the register before transferring data.

Description

  • This application is a continuation application of Provisional Application Serial No. 60/135,259, filed on May 21, 1999.[0001]
  • BACKGROUND
  • 1. Field of the Invention [0002]
  • The invention relates generally to methods and apparatus for data communications across a network. In particular, the invention relates to methods and apparatus for register based remote data flow control over a channel-based switching fabric interconnect or the like. [0003]
  • 2. Description of the Related Art [0004]
  • Conventionally, an Input/Output (I/O) node functioning as an intermediary between a host computer and a Local Area Network (LAN) consists of a bus master network interface card (NIC). This process is shown generally in FIG. 1. The I/O controller in the NIC is provided with specific information (i.e., descriptor or memory token) about each buffer in a list of buffers set up and maintained in the host. Every time the NIC receives one or more packets from the LAN (step [0005] 1), it reads the buffer list (step 2) and uses a Direct Memory Access (DMA) write operation across the bus to put the packet(s) into the next receive buffer(s) on the list (step 3) and send a notification of the placement of the packet(s) in the buffer(s) to the host (step 4). When the notification is received and processed, either the buffer(s) is emptied or a driver allocates a number of buffers (i.e., by placing another buffer(s) at the end of the list of buffers) in memory and sends a notification to the NIC of the information identifying each of the buffers (step 5). The I/0 controller of the NIC must continually read the buffer list and manage the information about the pool of host buffers in order to be able to write data into the proper buffer as LAN packets are received.
  • This process is shown generally in FIG. 1. There are a number of disadvantages to such a process. First, quite a relatively high amount of overhead information must be transferred on the bus between the host computer and the NIC concerning all of the buffers being set up, emptied and allocated in the host. This information must first be transferred from the host computer to the NIC so that the NIC can maintain and refer to the buffer list when transferring data. Then, when data is transferred to a specific host buffer, that data must be accompanied by information identifying the specific host buffer. The process also increases the number of operations required to transfer data from the NIC to the host computer. In addition to the data transfer operation itself, the NIC must also send a separate notification to the host computer. Since the data transfer operation is a DMA write operation, there is typically a response sent back acknowledging the successful transfer of data. These additional operations also increase the load on the bus connecting the host computer and the NIC. [0006]
  • Such a process also leads to complexity and latencies in the I/O controller of the NIC. The I/[0007] 0 controller must continuously receive and store the information concerning all of the buffers being set up, emptied and allocated in the host. It must also continuously maintain and refer to the list of host buffers in order to determine the proper buffer for data to be transferred into and to attach the corresponding buffer identifying information as overhead in the RDMA write operation transferring the data to that buffer. There can be significant latencies because of the several different operations across the bus and the processing of the buffer information in the I/O controller. Also, the host may be busy processing other tasks when it gets notified of the RDMA write operation and not realize that all of the buffers are full or close to full and that additional buffers need to be posted. In the meantime, the NIC may continue to receive LAN packets. If additional buffers are not posted to the bottom of the list of buffers in time, then all of the buffers may be consumed before the host responds to the notification. In such an event, there is an overflow at the NIC and the LAN packets have to be discarded. While the host node may re-request the lost data, it causes more LAN traffic which in turn increases the latency (and decreases the performance and efficiency) of the NIC when transferring data from the LAN to the host computer. Although additional buffering may be used to offset these effects to some extent, it increases the cost of the NIC, an important consideration in the LAN environment.
  • SUMMARY
  • The present invention is directed to methods and apparatus for data communications across a network. In a method according to an example embodiment of the invention, the first step of the method is to store a value in a register in the I/O node which is indicative of a number of send credits available to the I/O node. It is then determined from the value of the register whether or not there is a sufficient number of send credits available to the I/O node for the data to be transferred. If a sufficient number of send credits is available to the I/O node, it promptly transfers the data to the host over the channel-based switching fabric interconnect using send/receive semantics. If a sufficient number of send credits is not available to the I/O node, it waits for the host to update the value stored in the register before transferring the data.[0008]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The foregoing and a better understanding of the present invention will become apparent from the following detailed description of example embodiments and the claims when read in connection with the accompanying drawings, all forming a part of the disclosure of the invention. While the foregoing and following written and illustrated disclosure focuses on disclosing example embodiments of the invention, it should be clearly understood that the same is by way of illustration and example only and is not to be taken by way of limitation. The spirit and scope of the present invention being set forth by the appended claims. [0009]
  • The following represents a brief description of the drawings, wherein: [0010]
  • FIG. 1 is a block and flow diagram illustrating a conventional method of transferring data from a network interface card to a host computer. [0011]
  • FIG. 2 is a block diagram illustrating the NGIO/VI Architectural model used in an example embodiment of the invention. [0012]
  • FIG. 3 is a block diagram of the host and I/O node in an example embodiment of the invention.[0013]
  • DETAILED DESCRIPTION
  • With the advent of Next Generation Input/Output (NGIO) architecture, Version 1.0, published Jul. 22, 1999, low latency and high bandwidth channel-based switching fabric interconnects between a host computer and connected devices (including I/O nodes) have become a reality. This has opened new horizons for cluster computing. When implemented in conjunction with the Virtual Interface (VI) model described in the Virtual Interface Architecture Specification, Version 1.0, Dec. 16, 1997 (jointly authored by Intel Corporation, Microsoft Corporation, and Compaq Computer corporation), it is possible for distributed applications to perform low overhead communication using off-the shelf NGIO hardware. However, building high-level applications using primitives provided by the VI model is complex and requires substantial development efforts because the NGIO/VI channel-based switching fabric interconnect conventionally does not provide transport level functionality such as flow control, buffer management, fragmentation and reassembly. Moreover, it is impractical to implement existing network protocols such as the Transmission Control Protocol (TCP) over NGIO/VI because this would result in unnecessary additional overhead. TCP uses a sliding window flow control protocol incorporating sequence numbers, positive acknowledgments, error and duplicate detection, timeouts and retransmission of lost packets, etc., because the underlying network is presumed to be inherently unreliable. In contrast, NGIO/VI channel-based switching fabric interconnects have very low error rates and high reliability levels (delivery and reception) and consider transport errors catastrophic for reliable data delivery mode. Thus, due to the reliable data delivery and reception of NGIO/VI channel-based switched fabric interconnects, the channel connection is broken in the rare case of a lost packet or transport error. Since the virtual interface guarantees that data is delivered exactly once in order, many of the functions performed by TCP to ensure reliability are redundant and would add unnecessary overhead. [0014]
  • Even though the host computer and I/O node (and other devices in a computing cluster) can be connected by a NGIO/VI channel-based switching fabric having low latency and high bandwidth, the effective data transfer performance across that switching fabric can be less than optimum because of the lack of flow control and buffer management. This is especially true in computing clusters having an I/O node connected to a local area network (LAN) or other bursty, asynchronous, network where the amount of network traffic can increase or decrease suddenly and/or the data transfers can vary greatly in size and type from large pre-recorded contiguous blocks of image data, such as multimedia data from a CD-ROM, to much smaller heavily fragmented user data. The LAN packets received by the I/O nodes can range in maximum size anywhere from 1500 bytes to over 64000 bytes. In such installations, the manner in which the data packets are buffered and transferred by I/O nodes, host and other elements in the computing cluster can be crucial. Therefore, a need exists for a data communication service over a channel-based switching fabric interconnect that overcomes the disadvantages of conventional PCI compliant LAN NICs discussed above with respect to FIG. 1., yet still provides flow control and buffer management for data transfer between devices of a computing cluster connected by the switching fabric. [0015]
  • The example embodiment of the present invention is applied to a host computer and I/O node of a computing cluster connected to each other over a NGIO/VI channel-based switching fabric. The host computer has a processor, associated system memory with a plurality of allocated and configured buffers, and at least one internal bus connecting these components. It uses the VI Architectural Model which will be described shortly. However, the invention may be implemented in conjunction with other different channel-based switching fabric interconnects having messaging abilities. The example embodiment and other embodiments of the invention may utilize any other architecture and channel-based interconnect which supports both message passing and remote direct memory access, such as the System I/O (SIO) architecture currently being developed as a standardization of NGIO with other architectures. In this patent application, message passing refers to the transfer of data from one end of a channel to the other end wherein the unit receiving the data determines the desired location of the transferred data in its memory. In contrast, remote direct memory access (RDMA) operations allow the initiating end of a data transfer operation to identify the memory location at the receiving end of the channel where the data will be read or stored at the completion of the data transfer operation. According to the present invention, a channel is any means of transferring data, including but not limited to virtual channels, used to transfer data between two endpoints. While the example embodiment is an NGIO implementation and this channel definition is provided in the NGIO specification identified above, the present invention is not so limited. Furthermore, the terminology used in this application is consistent with the aforementioned NGIO specification, and other architectures may use different terminology to describe similar and corresponding aspects. For example, in NGIO, the smallest possible autonomous unit of data is called a cell, and a packet is made up of a number of such cells. In contrast, SIO uses the term “packet” to describe the smallest possible autonomous unit of data instead of “cell” as in NGIO, and uses the term “message” instead of “packet”. Furthermore, an SIO packet differs slightly from the corresponding NGIO cell. An NGIO cell has a fixed header size and a fized maximum payload of 256 bytes. An SIO packet has several headers of fixed length, but which are only conditionally present in the packet. Also, the payload of an SIO packet is a minimum of 256 bytes and the maximum payload is variable and negotiable. [0016]
  • For ease of comparison with the conventional method and apparatus discussed above with respect to FIG. 1, the example embodiment of the invention is directed in part to a network interface card (NIC) connected to a local area network. However, such an application is but one of several possible examples of the invention which may, of course, be applied to any I/O node or to any pair of devices where an improvement in transferring data between the devices is desired for whatever reason. An I/O node refers generally to any device or controller that connects a host device or cluster to a network fabric. Although the example embodiment of the invention is implemented and particularly well suited for data from a local area network, the invention is not so limited in its application. Other embodiments of the invention may be implemented for other networks, especially asynchronous, bursty, networks having widely varying and fluctuating data traffic which is not requested by the receiving host computer. [0017]
  • While the embodiments of the invention can be applied to any I/O technology, the traffic studied in local area networks typically has fragmented data in the first 64 bytes of transferred packets. There are inefficiencies in the fragmentation and reassembly because the data cannot be processed at the receiving end until the last cell containing an element of the data is received. Large transfers, in particular, can hold up resources for a long time since there must be a validation that the entire payload is without uncorrectable errors. (Error correction information may be included in the cells in addition to the header and payload.) [0018]
  • As shown in FIG. 2, the VI Architectural model includes a VI consumer [0019] 8 and a VI provider 24. A VI consumer 8 is a software process that communicates using a Virtual Interface (VI). The VI consumer 8 typically includes an application program 10, an operating system communications facility 12 (e.g., Sockets, Remote Procedure Call or RPC, MPI) and a VI user agent 14. The VI provider 24 includes the combination of a VI network interface controller (VI NIC) 18 and a VI kernel agent 16. It connects to the NGIO channel-based switching fabric through a channel adapter 30.
  • [0020] VI NIC 18 can directly access memory for data transfer operations with the channel-based switching fabric. There are a pair of work queues, one for send operations (a send queue 21) and one for receive operations (receive queue 19). The work queues store one or more descriptors 23 between the time it is Posted (placed in the queue) and the time it is Done (when the VI NIC has completed processing it). The descriptor 23 is a data structure recognizable by the VI NIC that describes a data movement request, and it includes a list of segments (a control segment, an optional address segment and one or more data segments). The control segment identifies the type of VI NIC data movement operation to be performed and the status of a completed NIC data movement operation. The data segment describes a communications buffer for the data transfer operations. A receive queue 19 contains descriptors that describe where to place incoming data. A send queue 21 contains descriptors that describe the data to be transmitted. A pair of Vis are associated using connection primitives (e.g., VipConnectWait, VipConnectAccept, VipConnectRequest) to allow packets sent at one VI to be received at the other VI. A send doorbell (not shown) and a receive doorbell (not shown) are provided for allowing the VI consumer to notify the VI NIC 18 that work (a descriptor describing a requested data transfer operation) has been placed in the send queue 21 and receive queue 19, respectively.
  • The [0021] VI user agent 14 is a software component that enables an operating system communication facility 12 to utilize a particular VI provider 24. The VI user agent abstracts the details of the underlying VI NIC hardware in accordance with an interface defined by an operating system communication facility 12. The VI user agent includes a library of primitives known as the VI primitives library (VIPL) that provide functions for creating a VI (VipCreateVI), for destroying a VI (VipDestroyVI), for connecting one VI to another VI (e.g., VipConnectWait, VipConnectRequest), for accepting or rejecting a VI connection request (VipConnectAccept or VipConnectReject), for terminating or disconnecting a connection between two VIs (VipDisconnect), to allow a process to register process memory with a VI NIC (VipRegisterMem), to post descriptors (to place a descriptor in a VI work queue using, e.g., VipPostSend, VipPostRecv), etc. Details of the VI primitives (VIPL) are set forth in the VI Architecture Specification, version 1.0, Dec. 16, 1997.
  • The [0022] kernel agent 16 is the privileged part of the operating system, usually a driver supplied by the VI NIC vendor, that performs the setup and resource management functions. These functions include connection setup/teardown, interrupt management and or processing, management of system memory used by the VI NIC and error handling. VI consumers access the kernel agent 16 using the standard operating system mechanisms such as system calls. As shown by arrow 26, the OS communication facility 12 makes system calls to the VI kernel agent 16 to perform several control operations, including to register memory. The VI architecture requires the VI consumer to register memory to be used for data transfer prior to submitting the request for data transfer. The memory regions used by descriptors and data buffers are registered prior to data transfer operations. Memory registration gives a VI NIC a method to translate a virtual address to a physical address. The user receives an opaque memory handle as a result of memory registration. This allows a user to refer to a memory region using a memory handle/virtual address pair without worrying about crossing page boundaries and keeping track of the virtual address to tag mapping. Memory registration enables the VI provider to transfer data directly between the registered buffers of a VI consumer and the channel-based switching fabric.
  • After registering memory, operating [0023] system communication facility 12 can use data transfer primitives of the VIPL library of VI user agent 14 to send and receive data. The VI Architecture defines two types of data transfer operations: 1) send/receive message passing, and 2) RDMA read/write operations. Once a connection is established, the operating system facility 12 posts the application's send and receive requests directly to the send and receive queues. The descriptors are posted (e.g., placed in a work queue) and then a doorbell is rung to notify the NIC that work has been placed in the work queue. The doorbell can be rung (and the VI NIC 18 notified of the work in the queue) without kernel processing. The VI NIC 18 then processes the descriptor by sending or receiving data, and then notifies the VI User Agent 14 of the completed work using the completion queue 22. The VI NIC 18 directly performs the data transfer functions in response to the posted descriptors.
  • The NGIO/VI Architecture supports an unacknowledged class of service at the NIC level. However, it does not perform other transport level functions, including flow control and buffer management. The VI Architecture Specification, version 1.0, Dec. 16, 1997 states at page 15 that “VI consumers are responsible for managing flow control on a connection.” The present invention is designed to provide data flow control over the NGIO/VI architecture or similar architecture. [0024]
  • An example embodiment of the invention is illustrated by the block diagram in FIG. 3. [0025] Host computer 300 has a device driver 301 configured according to the VI model described above, a host processor (CPU) 302 controlling operation of host computer 300 and a system memory 303 coupled to host processor 302 via a host bus. The device driver 301 is coupled to host memory 303 and to host processor 302. It has send/receive descriptors and information for credit based flow control. A host channel adapter (HCA) 304 connects host computer 300 to NGIO switching fabric 305. A portion of system memory 303 is allocated for a plurality of send buffers 303-1 and receive buffers 303-2, which are registered with device driver 301. Once the buffers are registered, device driver 301 can transfer incoming data directly from HCA 304 to a receiver buffer 303-2, and outgoing data can be directly transferred from a send buffer 303-1 to HCA 304. Pools of associated send and receive descriptors are also created and registered in device driver 301.
  • The switching fabric may contain many different switches SW and redundant paths (not shown) throughout the fabric, such that a plurality of messages can be traveling through the switching fabric at any given time. The switched fabric configuration can contain a plurality of channel adapters, such that there can be a multitude of different messages traveling through the fabric and where all of the various connected devices can continue operating while their messages are traveling through the switching fabric. [0026]
  • I/[0027] O node 310 is connected to NGIO switching fabric 305 through target channel adapter (TCA) 311 and to LAN 320 through a conventional LAN receive engine 313. I/O node 310 includes an I/O controller 312 configured according to the VI model described above. According to a feature of the invention, I/O controller 312 includes a credit register 314 storing credits indicating the number of receive buffers available in host computer 300. Device driver 301 is responsible for managing data flow between host computer 300 and I/O node 310 over a channel in NGIO switching fabric 305.
  • The data transfers are optimized through the [0028] device driver 301 and I/O controller 312 at all times. This helps avoid the processor or other elements of the host computer 300 or I/O node 310 from having to expend system resources to accomplish transfer of data blocks since there may be access conflicts with other functions. This method results in an immediate advantage compared to the conventional method shown in FIG. 1 which must have several operations across the fabric, plus a direct memory access (step 2) to place the data in a receive buffer 303-2 of host 300. The host channel adapter 304 and target channel adapter 311 provide all of the scatter/gather capability in the NGIO hardware so that the data is immediately delivered to the target as one contiguous block of data when possible. This minimizes the number of NGIO operations and transaction latency while improving the efficiency of data transfers. If necessary, data can be transferred in one or more data packets. In such an event, the individual data packets are successively transferred according to the same register based flow control scheme as intact data.
  • Before connection is started between [0029] host 300 and I/O node 310, a memory token is transferred to device driver 301. The memory token provides host 300 with access to credit register 314. The memory token can be of any format, e.g., simply a series of bits indicating the address of the remaining left-over data in memory of host computer 300. In the example embodiment, the memory token consists of a virtual address and a memory handle. The virtual address is determined by the I/O controller 312 and when received as part of a RDMA read operation, it is converted by a translation table into a physical address in memory corresponding to credit register 314. The I/O controller 312 may require that the memory handle accompanying the RDMA read operation is the same as that provided by it to ensure that the initiator of the RDMA read operation is entitled to access to the data. In advanced memory handle techniques, the memory handle may also indicate the authorization of the RDMA read operation to access the credit register 314. An important advantage of the example embodiment is that only credit register 314 and a single memory token need to be provided rather than the entire buffer list in the conventional system in FIG. 1.
  • After the initial RDMA write operation to initialize [0030] credit register 314, device driver will initiate multiple subsequent RDMA write operations as necessary to send credits to indicate that receive buffers 303-2 have been emptied or replenished, or that additional receive buffers 303-2 have been allocated in memory 303 of host 300. This process is indicated by the step (1) in FIG. 3. A key feature of this example embodiment is that I/O controller 312 does not have to manage a buffer list or buffer information. Indeed, the send credits in credit register 314 are updated by host 300 without any participation by I/O controller 312. CPU 302 schedules the buffer set up, and corresponding RDMA write operations, at a rate consistent with the resources on host 300. In particular, it schedules the buffer operation, and corresponding RDMA write operations, at the rate that it and I/O node 310 can best consume them thus increasing efficiency without additional demand on I/O node 310. A key advantage of this example embodiment is the efficiency with which the host 300 can use its resources. Host computers, especially servers, typically have many gigabytes of memory and a large amount of data that is being transferred to and from a network. But the amount of memory on an I/O node is relatively small in comparison. The granularity of cells passed back and forth in the NGIO switching fabric allows the example embodiment to optimize the use of receive buffers in the host 300.
  • There may be a series of data transfer operations sending LAN data to [0031] host computer 300 from I/O controller 312 according to the example embodiment of the invention as shown by step (2) in FIG. 3. I/O controller 312 counts the number of transfers in counter 316. As mentioned before, the I/O controller 312 can immediately transfer data whenever credit register 314 is greater than counter 316. Even though the data arrives asynchronously and unexpectedly from the LAN, it can be promptly forwarded to host 300 since there is no need for complicated processing. Conversely, the I/O controller 312 stops transferring data when host 300 consumes all of the registered receive buffers 303-2. As a result, data flow control can be simply and remotely established by host 300.
  • The data flow control also allows corrupted packets to be silently dropped between I/[0032] O node 310 and host 300. Although not shown, there is a counter 316 in the I/O node 310 that is incremented every time a LAN packet is sent to the host. The value of that counter is placed inside the data packet that is sent from I/O node 310 to host 300. If a LAN packet is received where the counter is equal to credit register 314, the packet is discarded and not sent. By placing the counter value in the send message itself allows the host to detect when it has missed packets. This increases efficiency and accomodates for the receive buffer 303-2 that did not get consumed by the defective packet. Every time host 300 detects a gap in count values in send messages, it also updates the credit register 314 so that it increases the number of send credits so that the buffer which was not filled can be used for another data packet.
  • This leads to another important feature of the example embodiment. The data flow control is set up by [0033] host channel adapter 304 and target channel adapter 311 over a channel with an unacknowledged class of service. If data packets were corrupted on an acknowledged channel, the data transfers would have to be stopped and restarted to compensate. If all of the data packets are held during that period of time, there would be a tremendous buffer requirement.
  • The example embodiment uses the messaging ability of the NGIO/VI architecture to send simple send messages over the unacknowledged channel. Consequently, I/[0034] O node 310 does not have to know anything about host 300 or the memory address location destination of the data. Instead of doing a RDMA write across the channel-based switching fabric, it uses a send message to a particular queue pair set up by the virtual interface in device driver 301 that has receive buffers 303-2 associated with corresponding receive queues. So as the data comes across the channel-based switching fabric, it goes into those buffers automatically. The only element that needs to know the particulars of the process is the virtual interface in device driver 301, not even the host channel adapter 304 and target channel adapter 311, although the data transfer is fairly simple.
  • Although an example embodiment, the invention is not limited to the example embodiment illustrated in FIG. 3. Indeed, an advantage of the invention is that it is particularly useful and widely adaptable to any I/O device having latency in data transfer operations. In this way, data transfers can be efficient in both a server that has a great deal of network I/O interfaces and other interfaces. The example embodiments will automatically adapt to transfer characteristics in which large blocks of data are generally asynchronously transferred as well as small blocks of data. Indeed, the example embodiments will adapt to any I/O data interface. [0035]
  • Other features of the invention may be apparent to those skilled in the art from the detailed description of the example embodiments and claims when read in connection with the accompanying drawings. While the foregoing and following written and illustrated disclosure focuses on disclosing example embodiments of the invention, it should be understood that the same is by way of illustration and example only, is not to be taken by way of limitation and may be modified in learned practice of the invention. While the foregoing has described what are considered to be example embodiments of the invention, it is understood that various modifications may be made therein and that the invention may be implemented in various forms and embodiments, and that it may be applied in numerous applications, only some of which have been described herein. It is intended by the following claims to claim all such modifications and variations. [0036]

Claims (20)

1. A method of transferring data from an I/O node to a host across a channel-based switching fabric interconnect, the method comprising:
storing a value in a register in the I/O node which is indicative of a number of send credits available to the I/O node;
determining from the value of the register if there is a sufficient number of send credits available to the I/O node for the data to be transferred;
promptly transferring the data from the I/O node to the host over the channel-based switching fabric interconnect if a sufficient number of send credits is available to the I/O node; and
otherwise, if a sufficient number of send credits is not available to the I/O node, waiting for the host to update the value stored in the register before transferring data.
2. The method of claim 1, wherein the data is transferred by a send message sent from the I/O node to the host over the channel-based switching fabric when sufficient send credits are available for the data.
3. The method of claim 1, wherein each of the send credits available to the I/O node represent one or more receive buffers which are available at the host for receiving and storing a data packet.
4. The method of claim 2, wherein said step of transferring data by sending a send message comprises:
preparing a descriptor describing the send operation to be performed; and
posting the descriptor to one of a plurality of work queues in the I/O node.
5. The method of claim 4, wherein the I/O node further comprises a plurality of send buffers storing the data to be transferred, the step of transferring data by sending a send message comprises:
preparing a descriptor describing the send operation to be performed;
posting the send descriptor to one of the work queues in the I/O node;
processing the posted send descriptor by transferring the data from one of the send buffers to the channel-based switching fabric interconnect.
6. The method of claim 4, wherein the host updates the value stored in the register by performing an RDMA write operation to the register.
7. The method of claim 5, further comprising:
fragmenting the data to be transferred into two or more data packets;
performing the following steps until all data packets have been sent:
a) determining if a sufficient number of send credits is available at the I/O node;
b) sending a data packet from the I/O node over the channel-based switching fabric if a sufficient number of send credits is available, and adjusting the number of send credits based on the sending of the data packet; and
c) otherwise, if a sufficient number of send credits is not available at the I/O node, waiting for the host to update the value stored in the register before sending a data packet.
8. An I/O node configured to communicate with a host across a channel-based switching fabric interconnect, the I/O node comprising:
a channel adapter connecting the I/O node to the channel-based switching fabric; and
a virtual interface, including:
a plurality of send and receive buffers;
a transport service layer, the transport service layer transferring data between the I/O node and the host;
an interface user agent coupled to the transport service provider;
a kernel agent;
a plurality of work queues; and
a network interface controller coupled to the kernel agent, the work queues and the channel adapter;
said virtual interface to issue one or more control commands to the kernel agent to establish a connection between the I/O node and the host across the channel-based switching fabric and to post data transfer requests to the work queues in response to commands from the transport service layer; and
the network interface controller to process the data transfer requests by transferring data between the send and receive buffers and the channel adapter.
9. The I/O node of claim 8 wherein the virtual interface is in accordance with at least a portion of the Virtual Interface (VI) Architecture.
10. The I/O node of claim 9, wherein the kernel agent comprises a Virtual Interface (VI) kernel agent, and the network interface controller comprises a Virtual Interface (VI) network interface controller.
11. An I/O node configured to communicate with a host across a channel-based switching fabric, said I/O node comprising:
a memory including send and receive application buffers;
a transport service layer providing for data transfer across the channel-based switching fabric;
a network interface controller coupled to the network;
a plurality of work queues coupled to the network interface controller for posting data transfer requests thereto;
a user agent coupled to the send and receive buffers and the network interface controller, the user agent posting data transfer requests to the work queues, the network interface controller processing the posted data transfer requests by transferring data between the send and receive buffers and the channel-based switching fabric.
12. An I/O node configured to communicate with a host computer over a channel-based switching fabric interconnect, the I/O node comprising:
a processor;
a register storing send credits;
one or more work queues for posting data transfer requests;
one or more registered send buffers;
one or more registered receive buffers;
a network interface controller coupled to the processor, the work queues, the buffers and the channel-based switching fabric, the network interface controller processing the posted data transfer requests by transferring data between the registered buffers and the channel-based switching fabric interconnect; and
the processor being programmed to control the transfer of data through the network interface controller according to a credit-based flow control scheme depending on the send credits stored in said register.
13. The I/O node of claim 12, wherein said processor is programmed to perform the following:
determine if a sufficient number of send credits is available;
send a send message containing a data packet from the I/O node to the host over the channel-based switching fabric interconnect if a sufficient number of send credits are available; and
otherwise, if a sufficient number of send credits is not available, waiting for the host to update the value stored in the register before transferring the data packet.
14. The I/O node of claim 13, wherein the I/O node places a count value of the number of data transfers in the send message.
15. The I/O node of claim 13, wherein the host updates the value stored in the register by performing an RDMA write operation to the register of the I/O node.
16. A host computer comprising:
a network interface controller adapter connecting the host to a host channel adapter on a channel-based switching fabric interconnect;
a host processor;
a memory having registered send and receive buffers; and
a device driver coupled to the host processor and the memory, and having one or more work queues for posting data transfer requests and a transport service layer providing an end-to-end credit-based flow control across the channel-based switching fabric interconnect according to the status of said registered receive buffers.
17. The host computer recited in claim 16, wherein the device driver comprises a virtual interface, including:
a transport service layer, the transport service layer transferring data between the I/O node and the host;
an interface user agent coupled to the transport service provider; and
a kernel agent coupled to the the kernel agent and the work queues,
said virtual interface issuing one or more control commands to the kernel agent to establish a connection between the I/O node and the host across the channel-based switching fabric and posting data transfer requests to the work queues in response to commands from the transport service layer.
18. The host computer of claim 17, wherein the virtual interface is in accordance with at least a portion of the Virtual Interface (VI) Architecture.
19. The host computer of claim 18, wherein the kernel agent comprises a Virtual Interface (VI) kernel agent, and the network interface controller comprises a Virtual Interface (VI) network interface controller.
20. The host computer of claim 16, wherein the device driver allocates receive buffers in the memory and performs an RDMA write operation to the I/O node to update a register storing the send credits of said I/O node.
US10/759,974 1999-05-21 2004-01-15 Register based remote data flow control Abandoned US20040174814A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/759,974 US20040174814A1 (en) 1999-05-21 2004-01-15 Register based remote data flow control

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US13525999P 1999-05-21 1999-05-21
US09/461,236 US6747949B1 (en) 1999-05-21 1999-12-16 Register based remote data flow control
US10/759,974 US20040174814A1 (en) 1999-05-21 2004-01-15 Register based remote data flow control

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US09/461,236 Continuation US6747949B1 (en) 1999-05-21 1999-12-16 Register based remote data flow control

Publications (1)

Publication Number Publication Date
US20040174814A1 true US20040174814A1 (en) 2004-09-09

Family

ID=32328612

Family Applications (2)

Application Number Title Priority Date Filing Date
US09/461,236 Expired - Lifetime US6747949B1 (en) 1999-05-21 1999-12-16 Register based remote data flow control
US10/759,974 Abandoned US20040174814A1 (en) 1999-05-21 2004-01-15 Register based remote data flow control

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US09/461,236 Expired - Lifetime US6747949B1 (en) 1999-05-21 1999-12-16 Register based remote data flow control

Country Status (1)

Country Link
US (2) US6747949B1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030026267A1 (en) * 2001-07-31 2003-02-06 Oberman Stuart F. Virtual channels in a network switch
US20030037268A1 (en) * 2001-08-16 2003-02-20 International Business Machines Corporation Power conservation in a server cluster
US20050015459A1 (en) * 2003-07-18 2005-01-20 Abhijeet Gole System and method for establishing a peer connection using reliable RDMA primitives
US20050015460A1 (en) * 2003-07-18 2005-01-20 Abhijeet Gole System and method for reliable peer communication in a clustered storage system
US20050080928A1 (en) * 2003-10-09 2005-04-14 Intel Corporation Method, system, and program for managing memory for data transmission through a network
US20070253439A1 (en) * 2006-04-27 2007-11-01 Ofer Iny Method, device and system of scheduling data transport over a fabric
US7911952B1 (en) * 2002-07-12 2011-03-22 Mips Technologies, Inc. Interface with credit-based flow control and sustained bus signals
US8683000B1 (en) * 2006-10-27 2014-03-25 Hewlett-Packard Development Company, L.P. Virtual network interface system with memory management
US8688798B1 (en) 2009-04-03 2014-04-01 Netapp, Inc. System and method for a shared write address protocol over a remote direct memory access connection
US9131011B1 (en) * 2011-08-04 2015-09-08 Wyse Technology L.L.C. Method and apparatus for communication via fixed-format packet frame
US10241922B2 (en) 2015-12-17 2019-03-26 Samsung Electronics Co., Ltd. Processor and method

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6944173B1 (en) * 2000-03-27 2005-09-13 Hewlett-Packard Development Company, L.P. Method and system for transmitting data between a receiver and a transmitter
US7190667B2 (en) * 2001-04-26 2007-03-13 Intel Corporation Link level packet flow control mechanism
US20030065735A1 (en) * 2001-10-02 2003-04-03 Connor Patrick L. Method and apparatus for transferring packets via a network
US20030084219A1 (en) * 2001-10-26 2003-05-01 Maxxan Systems, Inc. System, apparatus and method for address forwarding for a computer network
US20030101158A1 (en) * 2001-11-28 2003-05-29 Pinto Oscar P. Mechanism for managing incoming data messages in a cluster
US7145914B2 (en) 2001-12-31 2006-12-05 Maxxan Systems, Incorporated System and method for controlling data paths of a network processor subsystem
US7085846B2 (en) * 2001-12-31 2006-08-01 Maxxan Systems, Incorporated Buffer to buffer credit flow control for computer network
US7295561B1 (en) 2002-04-05 2007-11-13 Ciphermax, Inc. Fibre channel implementation using network processors
US7307995B1 (en) 2002-04-05 2007-12-11 Ciphermax, Inc. System and method for linking a plurality of network switches
US7406038B1 (en) 2002-04-05 2008-07-29 Ciphermax, Incorporated System and method for expansion of computer network switching system without disruption thereof
US7379970B1 (en) 2002-04-05 2008-05-27 Ciphermax, Inc. Method and system for reduced distributed event handling in a network environment
US20030195956A1 (en) * 2002-04-15 2003-10-16 Maxxan Systems, Inc. System and method for allocating unique zone membership
US20030200330A1 (en) * 2002-04-22 2003-10-23 Maxxan Systems, Inc. System and method for load-sharing computer network switch
US7327674B2 (en) * 2002-06-11 2008-02-05 Sun Microsystems, Inc. Prefetching techniques for network interfaces
US6986017B2 (en) * 2003-04-24 2006-01-10 International Business Machines Corporation Buffer pre-registration
US20050091334A1 (en) * 2003-09-29 2005-04-28 Weiyi Chen System and method for high performance message passing
US7912979B2 (en) * 2003-12-11 2011-03-22 International Business Machines Corporation In-order delivery of plurality of RDMA messages
US8009563B2 (en) * 2003-12-19 2011-08-30 Broadcom Corporation Method and system for transmit scheduling for multi-layer network interface controller (NIC) operation
US7349978B2 (en) * 2004-01-15 2008-03-25 Microsoft Corporation Spurious timeout detection in TCP based networks
US6973821B2 (en) * 2004-02-19 2005-12-13 Caterpillar Inc. Compaction quality assurance based upon quantifying compactor interaction with base material
US7441055B2 (en) * 2004-03-31 2008-10-21 Intel Corporation Apparatus and method to maximize buffer utilization in an I/O controller
US7644221B1 (en) * 2005-04-11 2010-01-05 Sun Microsystems, Inc. System interface unit
US7698477B2 (en) * 2005-11-30 2010-04-13 Lsi Corporation Method and apparatus for managing flow control in PCI express transaction layer
US7945719B2 (en) * 2006-09-20 2011-05-17 Intel Corporation Controller link for manageability engine
GB2465595B (en) * 2008-11-21 2010-12-08 Nokia Corp A method and an apparatus for a gateway
US9405725B2 (en) 2011-09-29 2016-08-02 Intel Corporation Writing message to controller memory space
US9313274B2 (en) * 2013-09-05 2016-04-12 Google Inc. Isolating clients of distributed storage systems

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4475433A (en) * 1980-06-06 1984-10-09 Muse Music Company, Limited Capo for a stringed musical instrument
US5208810A (en) * 1990-10-10 1993-05-04 Seiko Corp. Method of data flow control
US5550957A (en) * 1994-12-07 1996-08-27 Lexmark International, Inc. Multiple virtual printer network interface
US5633870A (en) * 1995-07-07 1997-05-27 Sun Microsystems, Inc. Method and apparatus for controlling data flow through an ATM interface
US5777624A (en) * 1996-01-02 1998-07-07 Intel Corporation Method and apparatus for eliminating visual artifacts caused by diffusing errors in a decimated video signal
US5825748A (en) * 1997-04-08 1998-10-20 International Business Machines Corporation Credit-based flow control checking and correction system
US5937436A (en) * 1996-07-01 1999-08-10 Sun Microsystems, Inc Network interface circuit including an address translation unit and flush control circuit and method for checking for invalid address translations
US6112263A (en) * 1997-12-15 2000-08-29 Intel Corporation Method for multiple independent processes controlling access to I/O devices in a computer system
US6125433A (en) * 1990-06-26 2000-09-26 Lsi Logic Corporation Method of accomplishing a least-recently-used replacement scheme using ripple counters
US6185620B1 (en) * 1998-04-03 2001-02-06 Lsi Logic Corporation Single chip protocol engine and data formatter apparatus for off chip host memory to local memory transfer and conversion
US6549540B1 (en) * 1999-03-15 2003-04-15 Sun Microsystems, Inc. Method and apparatus for bundling serial data transmission links to obtain increased data throughput
US6594701B1 (en) * 1998-08-04 2003-07-15 Microsoft Corporation Credit-based methods and systems for controlling data flow between a sender and a receiver with reduced copying of data
US6618354B1 (en) * 1998-03-13 2003-09-09 Hewlett-Packard Development Company, L.P. Credit initialization in systems with proactive flow control

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR970011859B1 (en) 1993-04-15 1997-07-18 삼성전자 주식회사 Encoding method and device for using fuzzy control
US5633867A (en) * 1994-07-01 1997-05-27 Digital Equipment Corporation Local memory buffers management for an ATM adapter implementing credit based flow control
US5515359A (en) * 1994-08-26 1996-05-07 Mitsubishi Electric Research Laboratories, Inc. Credit enhanced proportional rate control system
US5528591A (en) * 1995-01-31 1996-06-18 Mitsubishi Electric Research Laboratories, Inc. End-to-end credit-based flow control system in a digital communication network
FR2759518B1 (en) * 1997-02-07 1999-04-23 France Telecom METHOD AND DEVICE FOR ALLOCATING RESOURCES IN A DIGITAL PACKET TRANSMISSION NETWORK
US6347337B1 (en) * 1999-01-08 2002-02-12 Intel Corporation Credit based flow control scheme over virtual interface architecture for system area networks

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4475433A (en) * 1980-06-06 1984-10-09 Muse Music Company, Limited Capo for a stringed musical instrument
US6125433A (en) * 1990-06-26 2000-09-26 Lsi Logic Corporation Method of accomplishing a least-recently-used replacement scheme using ripple counters
US5208810A (en) * 1990-10-10 1993-05-04 Seiko Corp. Method of data flow control
US5550957A (en) * 1994-12-07 1996-08-27 Lexmark International, Inc. Multiple virtual printer network interface
US5633870A (en) * 1995-07-07 1997-05-27 Sun Microsystems, Inc. Method and apparatus for controlling data flow through an ATM interface
US5777624A (en) * 1996-01-02 1998-07-07 Intel Corporation Method and apparatus for eliminating visual artifacts caused by diffusing errors in a decimated video signal
US5937436A (en) * 1996-07-01 1999-08-10 Sun Microsystems, Inc Network interface circuit including an address translation unit and flush control circuit and method for checking for invalid address translations
US5825748A (en) * 1997-04-08 1998-10-20 International Business Machines Corporation Credit-based flow control checking and correction system
US6112263A (en) * 1997-12-15 2000-08-29 Intel Corporation Method for multiple independent processes controlling access to I/O devices in a computer system
US6618354B1 (en) * 1998-03-13 2003-09-09 Hewlett-Packard Development Company, L.P. Credit initialization in systems with proactive flow control
US6185620B1 (en) * 1998-04-03 2001-02-06 Lsi Logic Corporation Single chip protocol engine and data formatter apparatus for off chip host memory to local memory transfer and conversion
US6594701B1 (en) * 1998-08-04 2003-07-15 Microsoft Corporation Credit-based methods and systems for controlling data flow between a sender and a receiver with reduced copying of data
US6549540B1 (en) * 1999-03-15 2003-04-15 Sun Microsystems, Inc. Method and apparatus for bundling serial data transmission links to obtain increased data throughput

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030026267A1 (en) * 2001-07-31 2003-02-06 Oberman Stuart F. Virtual channels in a network switch
US20030037268A1 (en) * 2001-08-16 2003-02-20 International Business Machines Corporation Power conservation in a server cluster
US6993571B2 (en) * 2001-08-16 2006-01-31 International Business Machines Corporation Power conservation in a server cluster
US7911952B1 (en) * 2002-07-12 2011-03-22 Mips Technologies, Inc. Interface with credit-based flow control and sustained bus signals
US20050015459A1 (en) * 2003-07-18 2005-01-20 Abhijeet Gole System and method for establishing a peer connection using reliable RDMA primitives
US20050015460A1 (en) * 2003-07-18 2005-01-20 Abhijeet Gole System and method for reliable peer communication in a clustered storage system
US7593996B2 (en) * 2003-07-18 2009-09-22 Netapp, Inc. System and method for establishing a peer connection using reliable RDMA primitives
US7716323B2 (en) 2003-07-18 2010-05-11 Netapp, Inc. System and method for reliable peer communication in a clustered storage system
US20050080928A1 (en) * 2003-10-09 2005-04-14 Intel Corporation Method, system, and program for managing memory for data transmission through a network
US7496690B2 (en) * 2003-10-09 2009-02-24 Intel Corporation Method, system, and program for managing memory for data transmission through a network
US20100061392A1 (en) * 2006-04-27 2010-03-11 Ofer Iny Method, device and system of scheduling data transport over a fabric
US7619970B2 (en) * 2006-04-27 2009-11-17 Dune Semiconductor Ltd. Method, device and system of scheduling data transport over a fabric
US20070253439A1 (en) * 2006-04-27 2007-11-01 Ofer Iny Method, device and system of scheduling data transport over a fabric
US7990858B2 (en) * 2006-04-27 2011-08-02 Dune Networks, Inc. Method, device and system of scheduling data transport over a fabric
US8683000B1 (en) * 2006-10-27 2014-03-25 Hewlett-Packard Development Company, L.P. Virtual network interface system with memory management
US8688798B1 (en) 2009-04-03 2014-04-01 Netapp, Inc. System and method for a shared write address protocol over a remote direct memory access connection
US9544243B2 (en) 2009-04-03 2017-01-10 Netapp, Inc. System and method for a shared write address protocol over a remote direct memory access connection
US9131011B1 (en) * 2011-08-04 2015-09-08 Wyse Technology L.L.C. Method and apparatus for communication via fixed-format packet frame
US9225809B1 (en) 2011-08-04 2015-12-29 Wyse Technology L.L.C. Client-server communication via port forward
US9232015B1 (en) 2011-08-04 2016-01-05 Wyse Technology L.L.C. Translation layer for client-server communication
US10241922B2 (en) 2015-12-17 2019-03-26 Samsung Electronics Co., Ltd. Processor and method

Also Published As

Publication number Publication date
US6747949B1 (en) 2004-06-08

Similar Documents

Publication Publication Date Title
US6747949B1 (en) Register based remote data flow control
US6594701B1 (en) Credit-based methods and systems for controlling data flow between a sender and a receiver with reduced copying of data
US7103888B1 (en) Split model driver using a push-push messaging protocol over a channel based network
US7281030B1 (en) Method of reading a remote memory
US7519650B2 (en) Split socket send queue apparatus and method with efficient queue flow control, retransmission and sack support mechanisms
EP1142215B1 (en) A credit based flow control method
US7912988B2 (en) Receive queue device with efficient queue flow control, segment placement and virtualization mechanisms
US6615282B1 (en) Adaptive messaging
Buonadonna et al. An implementation and analysis of the virtual interface architecture
US8769036B2 (en) Direct sending and asynchronous transmission for RDMA software implementations
CA2509404C (en) Using direct memory access for performing database operations between two or more machines
CN113711551A (en) System and method for facilitating dynamic command management in a Network Interface Controller (NIC)
US6888792B2 (en) Technique to provide automatic failover for channel-based communications
US6742051B1 (en) Kernel interface
US20070220183A1 (en) Receive Queue Descriptor Pool
US20130290558A1 (en) Data transfer, synchronising applications, and low latency networks
GB2339517A (en) Message transmission between network nodes connected by parallel links
US7457845B2 (en) Method and system for TCP/IP using generic buffers for non-posting TCP applications
US6898638B2 (en) Method and apparatus for grouping data for transfer according to recipient buffer size
EP1302855A2 (en) Method of sending a request
US7209489B1 (en) Arrangement in a channel adapter for servicing work notifications based on link layer virtual lane processing
Ahuja et al. Design, implementation, and performance measurement of a native-mode ATM transport layer (extended version)
KR100412010B1 (en) Flow architecture for remote high-speed interface application
US20060004933A1 (en) Network interface controller signaling of connection event
US7292593B1 (en) Arrangement in a channel adapter for segregating transmit packet data in transmit buffers based on respective virtual lanes

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTEL CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:FUTRAL, WILLIAM T.;REEL/FRAME:014905/0902

Effective date: 19991215

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION