CN100486178C - A remote internal memory sharing system and its realization method - Google Patents

A remote internal memory sharing system and its realization method Download PDF

Info

Publication number
CN100486178C
CN100486178C CNB2006101648500A CN200610164850A CN100486178C CN 100486178 C CN100486178 C CN 100486178C CN B2006101648500 A CNB2006101648500 A CN B2006101648500A CN 200610164850 A CN200610164850 A CN 200610164850A CN 100486178 C CN100486178 C CN 100486178C
Authority
CN
China
Prior art keywords
piece
memory
node
remote
application
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2006101648500A
Other languages
Chinese (zh)
Other versions
CN1972215A (en
Inventor
王楠
刘旭辉
韩冀中
贺劲
章立生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CNB2006101648500A priority Critical patent/CN100486178C/en
Publication of CN1972215A publication Critical patent/CN1972215A/en
Application granted granted Critical
Publication of CN100486178C publication Critical patent/CN100486178C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

This invention discloses one remote memory sharing system, which comprises manager and computer settlement points, wherein, the points are fixed with remote memory sharing system to supply and manage memory block drive module with memory drive module connected to network; computer points and memory servo are connected to manager through network; the points also comprises mapped and exchanger. This invention also discloses one remote memory sharing realization method.

Description

A kind of remote internal memory sharing system and its implementation
Technical field
The present invention relates to the realization of memory shared in the computer, particularly a kind of system and method that is used to realize that long-distance inner is shared.
Background technology
Along with the continuous development of network technology, the speed of network improves constantly.InfiniBand has appearred in recent years, new technologies such as 10G Ethernet.The appearance of these technology, current for good condition has been created in the application of group system, the group system widespread usage is at aspects such as parallel computation, Web, science calculating, databases.
Group system can provide a large amount of CPU and memory sources for the user, but because each node still is the individuality of an autonomy in the cluster, its memory source can not effectively be utilized, make the memory source of cluster exist serious waste: in desktop application, along with the increase that is the internal memory of cluster configuration, free memory also increases thereupon in the cluster.The 180--192MB of 12--14MB from 32MB in the 256MB; On memory size surpassed the machine of 64MB, half internal memory free time surpassed 12 minutes, and 1/4th internal memory is idle above 30 minutes; In whole nodes of cluster, free memory proportion in the total internal memory of cluster is 60-68%; Only in idle node, just there is 53% the internal memory that accounts in the cluster total internal memory to be in idle condition (ask for an interview list of references 4:A.Acharya and S.Setia.Availability and utility of idle memory in workstation clusters[J] .ACM SIGMETRICS Performance Evaluation Review, 1999.).
And on the other hand, can also observe in actual applications, because laod unbalance between node, busy node is forced to use the disk swapace owing to physical memory is not enough.Though the disk swapace is brought up to the restriction that only is subjected to disk size with available internal memory quantity on the applied logic, and the disk price is well below internal memory, and the speed of disk also is much more slowly than internal memory.This has just caused in group system, and idle node has a large amount of internal memory free time not to be applied, and busy node is because the physical memory deficiency causes the ineffective unbalanced phenomena of node.
Overcome the unbalanced phenomena that memory headroom utilizes in the above-mentioned group system, can be by sharing of the memory headroom in the cluster be realized.About utilizing the research of idle memory headroom in the cluster, since the nineties in last century just.People such as Michael J.Franklin just found in 1992, in the Database Systems of C/S, when Server is a plurality of Client service, Cache on Client is also meaningful to other Client, therefore be necessary to carry out memory management (the visible list of references 1:M.J.Frankling of concrete condition of the overall situation, M.J.Carey, and M.Livny.Globla memory management in client-server dbms architectures[A] .InProceeding of the 18th VLDB Conference[C] .1992.).People such as Liviu Iftode in 1993 have proposed the notion of " inner server ", node in the cluster is divided into " calculating node " and " internal memory service station ", the latter (asks for an interview list of references 2:L.Iftode for it provides internal memory when the former produces page fault, K.Li, and K.Petersen.Memory servers for multicomputers[A] .InProceeding of the IEEES pringCOMPCON93[C] .1993:538-547.).Nineteen ninety-five Michael J.Feeley has proposed a GMS system on the basis of list of references 1 research, mostly GMS system afterwards is that the improvement to this system (asks for an interview list of references 3:M.J.Feele y, W.E.Morgan, F H.Pighin, A.R.Karlin, H.M.Levy, and C.A.Thekkath.Implementing global memory management in a workstation cluster[J] .ACMSIGOPS Operating Systems Review, 1995, pages 201-212.).There are following characteristics in this GMS system:
Global memory's management, node dynamically increases and decreases;
Equity does not have the difference of Server and Client between the node;
Use disk to increase memory capacity;
Do not have management node, from calculate node, elect master and manage.
Robot system ground such as Anurag Acharya were analyzed the availability of free memory in the cluster in 1999, had drawn the conclusion that has serious internal memory waste in the cluster mentioned above, and free memory has directive significance in the cluster to using.
People such as Michael R.Hines used the technology in network exchange space in Anemone in 2003, proposed to use a kind of new method of long-distance inner (to ask for an interview list of references 5:M.R.Hines, M.Lewandowski, andK.Gopalan.Anemone:Adaptive network memory engine[D] .Master ' s thesis, FloridaState University, 2003.).Afterwards, Tia Newhall (list of references 6:T.Newhall, S.Finney, K.Ganchevm, and M.Spiegel.Nswap:a network swapping module for linux clusters[A] .InProceeding of Euro-Par ' 03 International Conference on Parallel and DistributedComputing[C] .Klagenfurt, Austria, 2003.), Guozhong Sun (list of references 7:H.Tang Sun, M.Chen, and J.Fan.A scalable dynamic network memory service system[A] .InProceeding of High-Performance Computing inAsia-Pacific Region[C] .2005.) realized the network exchange space with Nswap and NBD respectively.Shuang Liang had realized network exchange space (list of references 8:S.Liang on InfiniBand in 2005, R.Notonha, and D.K.Panda.Swapping toremote memory over infiniband:An approach using a high performance network blockdevice[J] .IEEEClusterComputing, September 2005.).
Sum up previous finding, utilizing the Remote Node RN internal memory to improve the realization technical elements of cluster performance, mainly be divided into following several:
Kernel mode is caught long-range paging (page fault);
Memcpy formula interface is provided;
Block device and swapace;
Catch SIGSEGV in user's attitude.
People such as Eric A.Anderson use the implementation method of network internal storage to sum up to 6 kinds:
Explicit modification application program: writing direct in application program deposits in data long-range and from the long-range code of giving data for change.The code of this Technology Need application programs is done great modification.But this implementation can promote performance to greatest extent, reduces expense, because this scheme has been handled whole internal memory use structure and can have been designed at the application access pattern;
New malloc function: this solution needs the user's modification code to use new malloc function from the network allocation internal memory.This scheme than last a kind of lacking, but still needed to revise code to the modification of existing code.This machine internal memory that the user can limits application takies makes other application speed improve.For several schemes in back, this scheme versatility is best: the hypothesis that it is done operating system is minimum, and page fault interrupts and can bring extra expense by system call processing page table but handle;
Device driver: this scheme replaces to switching equipment the equipment that sends the page to network internal storage.The advantage of this scheme do not need to be application programs and system kernel to make any modification.It is long-range that shortcoming is that the page of system leaves in, and reliability has been proposed very high requirement.Its versatility is lower than upward a kind of scheme, but is better than the modification kernel.The expense of this programme comprises the expense of VM subsystem, the expense that kernel state-user's attitude is switched.If used extra process, then to add the expense that the process context switches;
Revise kernel: do not revise application program, this scheme can provide best performance, because it does not need extra interruption or context to switch.But between different architectures, the scheme of revising kernel can not be general.Because the expense of VM has descended, this scheme is more desirable, and this scheme has reduced many other expenses;
Network interface: this scheme need be replaced Memory Controller Hub with certain network internal storage chip, so this scheme is that versatility is the poorest.But it has been applied to being used to handle shared drive in the multiprocessor architectures such as Alewife, Flash, Shrimp.The expense of this scheme is lower than the modification kernel, because it can be a unit and needn't be unit with the full page with cache lines (Cache-line), so data quantity transmitted is little.This in addition scheme can be avoided the expense of VM.Multiprocessor can be introduced and safeguard the conforming expense of internal memory, if but have only a processor to use this data, this expense just can be ignored.
About the visible list of references 9:E.A.Andersonand of detailed content J.M.Neefe.An exploration of network ram[R to existing network memory shared implementation method] .Technical Report CSD-98-1000, UCBerkley, December 1994.
Corresponding with above-mentioned memory shared implementation method, the architecture of existing internal memory sharing system can be divided into the client-server formula and to equation.The client-server formula is not a kind of method of using the Remote Node RN free memory strictly speaking, because inner server is not born calculation task.Major part all with disk storage medium in support, can provide huge memory space for application program in the client-server structure Design.Can effectively use the free memory of node to the design of equation, but because there is not the reserve memory space, there is restriction in the space that can provide the structure of equation.
Though up to the present, forefathers have done a lot of correlative studys, and some problems still have to be solved.
Demands of applications is various, and the application need of compute classes directly calculates with internal memory, and the application of storage class tends to use the interface of block device with metadata cache.Existing technology can not be taken into account this two kinds of application.
Along with the rise of Network storage technology (for example iSCSI), kernel is used increasing to the demand of internal memory.People such as Xubin He propose to add Cache to promote the performance of iSCSI in the structure of iSCSI, people such as JizhongHan have analyzed the I/O characteristic of iSCSI, have proved by making full use of the iSCSITarget internal memory to improve the feasibility of iSCSI performance.People such as Xuhui Liu have finished Remote iCache by NBD.And any iCache must realize at kernel spacing.Existing technology can't provide support for the application of kernel spacing.
In addition, the interconnection structure of Client-Server formula must use special internal memory service station, and the CPU and the memory source of these nodes are wasted to a certain extent; And the space of TB even PB level can't be provided as the MemoryServer that makes the reserve storage of disk the structure of equation.
Summary of the invention
The objective of the invention is to overcome existing internal memory sharing system and can't support extensive memory request, and the internal memory laod unbalance, cause the defective of the wasting of resources, thereby improve a kind of remote internal memory sharing system efficiently.
Another object of the present invention is to overcome existing internal memory sharing system can not take into account the defective that compute classes is used and storage class is used simultaneously, thereby a kind of remote internal memory sharing system that can take into account two types application is provided.
The 3rd purpose of the present invention is to overcome existing internal memory sharing system can not take into account the defective that kernel state is used and user's attitude is used simultaneously, thereby a kind of remote internal memory sharing system that can take into account two types application is provided.
Another purpose of the present invention provides a kind of method that realizes that long-distance inner is shared.
To achieve these goals, the invention provides a kind of remote internal memory sharing system, comprise manager and at least two calculating nodes, also comprise inner server; Wherein, promising described remote internal memory sharing system is installed on described calculating node the also memory pool driver module of managing physical memory block is provided, the described calculating node that has the memory pool driver module is connected on the network by connector, and described calculating node all is connected with described manager by network with described inner server; All calculate the memory pool driver module of node and form a memory pool for the internal memory that physical memory and inner server provided that remote internal memory sharing system provides, and the internal memory in the memory pool is shared for the application that each calculates in the node; On described calculating node, also comprise mapper that memory-mapped is provided and the interchanger that block device is provided; Described memory pool driver module also comprises to the kernel state application provides interface;
Described memory pool driver module is by idle chained list free_list, be used to link the LLL chained list that the local piece of using the physical memory piece of employed local node is described, be used to link the RLL chained list that the piece of the physical memory piece of the employed local node of remote application is described, be used to link the RRL chained list that the piece of the physical memory piece of the employed Remote Node RN of remote application is described, being used for linking LRL chained list that the local piece of using the physical memory piece of employed Remote Node RN describes manages the physical memory piece of whole remote internal memory sharing system;
In described remote internal memory sharing system, use and deposit a physical memory piece in this remote internal memory sharing system, in return, described remote internal memory sharing system offers physical memory piece of described application and uses; Described application has control completely to the physical memory piece of distributing to it, and can not directly use again for the physical memory piece that described application offers described remote internal memory sharing system, described application can only gain with another one physical memory piece by the numbering of described physical memory piece when needing, described remote internal memory sharing system guarantees that the piece content that gains is consistent with the content that had before deposited in, but what do not guarantee to gain is same physical memory piece.
In the technique scheme, described mapper is a character device of realizing the mmap interface.
In the technique scheme, described interchanger is the interface of a block device.
In the technique scheme, described inner server uses file system, by the page cache technology of operating system, as buffer memory, provides internal memory to the calculating node in the remote internal memory sharing system with the memory source of this node.
The present invention also provides a kind of method that realizes that long-distance inner is shared, and comprising:
Step 10, an application application memory block that calculates on the node, the piece of free block at first transferred from the idle chained list free_list of this locality is described to and is used for linking the LLL chained list that the local piece of using the physical memory piece of employed local node is described, if in free_list, do not have the piece description of free memory block or the piece of free memory block to describe inadequately, then carry out next step;
Step 20, local free memory block can not satisfy local demands of applications, the memory pool driver module sends request to manager, require manager to seek free memory block for it, manager returns to the node of the request of sending with lookup result, sends between the node at the node of request and free memory block place to establish a communications link;
Step 30, after the node at local node and free memory block place establishes a communications link, the piece that the memory pool driver module will be used for linking the RLL chained list afterbody that the piece of the physical memory piece of the employed local node of remote application describes is described transfer of data in the pairing physical block to Remote Node RN, and these piece description chains are received the chained list RRL that the piece of the physical memory piece that is used to link the employed Remote Node RN of remote application is described, the physical memory piece of being vacated uses for local the application, if resulting free memory block still can not satisfy demands of applications, then carry out next step;
Step 40, the piece of afterbody among the chained list LLL described to be connected to be used to link the chained list LRL that the local piece of using the physical memory piece of employed Remote Node RN is described, the data that this piece is described in the pairing physical memory piece are sent on the physical memory piece of Remote Node RN, and the physical memory piece of being vacated uses for local the application;
Application access memory block on step 50, the local node comprises:
Step 51, judge the position that the local piece of using the memory block that will visit is described,, carry out next step, be described on the LRL execution in step 53 as if the piece of memory block if the piece of memory block is described on the chained list LLL;
The piece of step 52, the memory block that will visit is described on the LLL, the piece of access memory piece is described mentioned the LLL linked list head, replaces original physical block address with the physical block address that imports into, gives application with the physical block address of originally preserving;
The data of step 53, the memory block that will visit are on Remote Node RN, local node and Remote Node RN are made telecommunication, the piece that LLL is gone up the chained list afterbody is described between the physical memory piece at the data place that will visit on pairing physical memory piece and the Remote Node RN and is carried out exchanges data, the piece description chain that data is switched to the memory block of Remote Node RN is gone into the head of LRL, after the piece of LLL afterbody is described pairing physical memory piece and exchanged to the data that will visit, this piece is described the head that places LLL;
Application access memory block on step 60, the Remote Node RN, comprise that the application on the Remote Node RN describes pairing physical memory piece and conduct interviews being kept at piece in the local RRL chained list, the memory location of data on Remote Node RN that this piece is described pairing physical memory piece returns to the application of Remote Node RN, and this piece described delete from the RRL chained list
Step 70, local use of using memory block finish, and discharge memory block, and preferential release block is described the physical memory piece that is positioned on the LRL;
Step 80, release block are described the physical memory piece that is positioned on the chained list LLL, after memory block release finishes, inspection is on current chained list LRL, whether also have piece to describe, if also having piece describes, then utilize the piece discharged to describe the corresponding physical memory block, the data that save the data in the memory block on the Remote Node RN local node that reads back.
The invention has the advantages that:
1, remote internal memory sharing system of the present invention can the realization system in the equilibrium of each node memory source, overcome the unreasonable wasting of resources phenomenon that causes of computational resource allocation.
2, remote internal memory sharing system of the present invention provides memory-mapped by mapper, can be used for the application of internal memory class; Provide block device by swapper, can be used for the application of storage class, overcome prior art and can not be applied to the defective that the internal memory class is used and storage class is used simultaneously.
3, the present invention uses for user's attitude by mapper and swapper interface is provided, and using for kernel state by the kernel interface function provides interface, has overcome prior art and can not be applied to the defective that user's attitude is used and kernel state is used simultaneously.
Description of drawings
Fig. 1 is the structure chart of remote internal memory sharing system of the present invention;
Fig. 2 calculates the procedure chart of an application of node to remote internal memory sharing system application, use and release block;
Fig. 3 is a schematic diagram that calculates 5 vacant preceding 5 chained lists of free memory block on the node;
Fig. 4 is that the schematic diagram that takies 5 chained lists behind the physical memory piece is used in a this locality of calculating on the node;
Fig. 5 is the schematic diagram of 5 chained lists after this locality application reading of data of calculating on the node;
The schematic diagram of Fig. 65 chained lists that are Remote Node RN behind the local node scheduling memory block;
Fig. 7 uses the schematic diagram of applying for 5 chained lists behind 6 memory blocks once more for this locality of computing node;
6 memory blocks that Fig. 8 is applied as new application for this locality of computing node add the schematic diagram of 5 chained lists behind the corresponding datas;
The situation of change schematic diagram of Fig. 9 chained list when calculating the local application access memory block 1 in the node;
Figure 10 is for calculating the schematic diagram of local application access memory block 1 back 5 chained lists in the node;
When Figure 11 will visit the memory block 2 of local node for the application on the Remote Node RN, the situation of change schematic diagram of chained list;
Figure 12 is that the application on the local node discharges memory block 1,9 and, the situation of change schematic diagram of chained list at 5 o'clock;
Figure 13 is a schematic diagram that calculates node application memory block.
Embodiment
The invention will be further described below in conjunction with the drawings and specific embodiments.
As shown in Figure 1, be the structure chart of remote internal memory sharing system of the present invention, this remote internal memory sharing system is formed by connecting by network by a plurality of calculating nodes.In whole system, each calculates node can be divided into three kinds of roles, is respectively memory pool driver module (rmp_pool), manager (rmp_mamager) and inner server (rmp_memserver).
Memory pool driver module (rmp_pool) is installed in and calculates on the node, it offers remote internal memory sharing system (RMP) with a part of physical memory of this node, and resulting all shared physical memories form a memory pool to remote internal memory sharing system on the node from calculating.The memory pool driver module externally provides an interface group, and the application of remote internal memory sharing system (RMP) is calculated the physical memory that node offers RMP by addressable each of this interface group.Rmp_pool is the core of RMP, and it uses connector (connector) and other node (calculating node or inner server) interconnection.
Go up two special application in addition at the memory pool driver module (rmp_pool) that calculates node, be respectively mapper (mapper) and interchanger (swapper).Wherein, mapper (mapper) is to go up an application that makes up at memory pool driver module (rmp_pool).It is a character device of realizing the mmap interface.User's attitude program can be by the mmap system call with the address space of RMP memory-mapped to consumer process.Interchanger (swapper) is to go up an application that makes up at memory pool driver module (rmp_pool).Swapper is the interface of a block device, and it becomes block device with the RMP memory-mapped.Can on this, make up file system and use, also can use it to make swapace with the elevator system performance to satisfy storage class.The present invention provides memory-mapped by mapper, can be used for the application of internal memory class; Provide block device by swapper, can be used for the application of storage class, overcome prior art and can not be applied to the defective that the internal memory class is used and storage class is used simultaneously.Simultaneously, rmp_pool uses to kernel state DLL (dynamic link library) is provided, and has overcome prior art and can not be applied to the defective that kernel state is used and user's attitude is used simultaneously.
Manager (manager) is responsible for and proposes to use the calculating node of Remote Node RN memory request to seek and storage allocation, in remote internal memory sharing system, each calculates node and all is connected with calculating node as manager, and this node is registered to manager for the amount of memory that RMP provides.Manager is preserved all link informations of calculating node and inner server.When calculating node application internal memory, manager finds free memory by the mode of poll for it, and calculating node can connect by information and other node that manager beams back.The task that manager bears does not seldom need complicated calculating and a large amount of transmission, does not therefore need special node, can be deployed in and calculate on node or the inner server.
Inner server utilizes self jumbo memory space according to the agreement of RMP, and " internal memory " externally is provided.Inner server is not that memory source itself is contributed to RMP, but the file system by operating system, and self storage resources is offered RMP.Though at other node, the inner server node is exactly a calculating node that big capacity memory source and their equities are arranged, in fact it is a node that is exclusively used in service, and itself does not bear calculation task.Inner server uses file system, can be by the technology such as page cache of operating system, the memory source of this machine as buffer memory, is accelerated access speed.Inner server is deployed on the node of a special use usually, is not deployed in to calculate on the node.
For the calculating node that will have memory pool driver module (rmp_pool) is connected on the network, need on the calculating node, load connector (connector).Connector is one and is used to the class that realizes connecting, can be with the RMP network design on various types of networks by connector.
In remote internal memory sharing system of the present invention, be the key that realizes that long-distance inner is shared to each memory management of calculating on the node.Calculate on the node at each, tell the idle physical memory of a part, these free memories are registered on the manager as the part of memory pool.Remote internal memory sharing system provides the interface in access memory pond for the application on each calculating node, calculate node under the scheduling of manager, the idle physical memory of this node is offered other nodes uses, the perhaps request of using according to this locality is used to the idle physical memory of other node requests.
Memory pool driver module (rmp_pool) will calculate internal memory on the node and big piece such as be divided into, and be the least unit of memory management with " piece ".The size of one " piece " is one or several continuous memory pages.Owing to taked this management mode, the interface of the bottom that remote internal memory sharing system (RMP) externally provides (towards the interface of kernel application) is based on " exchange ", that is: application deposits a physical memory piece in RMP, and in return, RMP offers its physical memory piece and uses.To distributing to the piece of application, be referred to as " belonging to application ", using has control completely to it, can carry out reading and writing arbitrarily even release; RMP operation such as does not transmit to it.And offer the physical block of RMP for application, and RMP may read and write, transmit, and application can not directly be used again, uses when needing and can only gain with another physical block by the numbering of piece.RMP guarantee the piece content gain with before deposited in consistent, but what do not guarantee to gain is same physical block.
In Fig. 2, described calculate node application how to the process of remote internal memory sharing system application, use and release block.
Use and at first obtain the page from RMP.RMP returns to application with the numbering of piece.Use afterwards according to numbering and give piece new or that before deposited in for change, and the piece that deposits in replaces this numbering from RMP.By the RMP statement piece is discharged after finishing using.Use and be indifferent to rmp_pool and how handle the page that deposits in.
When the memory pool driver module (rmp_pool) on the calculating node manages the physical memory piece, 5 kinds of chained lists have been adopted, the memory pool driver module is linked at the description (being called for short piece describes) to memory block on 5 chained lists, to realize scheduling and the distribution to the physical memory piece.Described 5 chained lists are respectively free_list, LLL, LRL, RLL, RRL.Wherein free_list is idle chained list, all neither use to be used also not the piece of the physical memory piece that is used by Remote Node RN to describe by this locality and are connected on this chained list, describe when needs divide the piece of timing looks for a free time from free_list physical memory piece, and this piece is described physical memory piece pairing with it distribute to application.Implication to all the other 4 chained lists has been done detailed description in table 1.
Figure C200610164850D00141
Table 1
In above-mentioned chained list, LRL and RLL be non-NULL, because these two chained lists are when all on behalf of remote application, non-NULL use local physical memory piece, the local application also at the physical memory piece that uses on the Remote Node RN, this situation can reduce the operating efficiency of remote internal memory sharing system greatly, and RMP can avoid this situation when distributing and discharge internal memory.Piece in above-mentioned 5 kinds of chained lists is described the corresponding information that has comprised the physical memory piece, as numbering, position, numbering etc. on Remote Node RN.Piece on chained list LLL and the chained list RLL has described a physical memory piece in the local node corresponding.
Remote internal memory sharing system of the present invention realizes that the method that long-distance inner is shared comprises the application of memory block, the visit of memory block and the release three big steps of memory block, describes respectively below.
One, the application of memory block.
Step 10, an application application memory block that calculates on the node, the piece of free block at first transferred from the free_list of this locality is described in the LLL chained list, if in free_list, do not have the piece description of free memory block or the piece of free memory block to describe inadequately, then carry out next step.
As shown in Figure 3, in one embodiment, 5 free memory blocks are arranged on the idle chained list free_list, come the mark memory block with 1,2,3,4,5 respectively.The local physical address of each memory block is respectively 0x0,0x1,0x2,0x3 and 0x4.Free memory block is labeled as clean, and expression does not also deposit significant information in these pieces at present.After application adopts a new piece to replace it, claim that this piece is dirty's.When remote transmission, if a piece is can not transmitting of clean, to save the network bandwidth.Describe for convenient, 1 physical memory piece is only held in the application in the later example on this node of hypothesis, and initial physical address is 0x5.
In Fig. 4, a this locality of calculating on the node is used and is occupied a physical memory piece, and this is used to free memory block of RMP application, the piece of free memory block 1 is described transferred among the chained list LLL by idle chained list free_list.
Mention in the description in front, be applied in the memory block among the RMP is done when visit, be based on " exchange ".Before being applied in access memory piece 1 own physical memory piece is arranged, the address of this physical memory piece is 0x5, and based on the principle that exchanges, the physical address of the memory block of crossing through application access 1 no longer is original 0x0, but 0x5.As shown in Figure 5, in store data AAA in the physical memory piece after the exchange is recorded in the piece description.
Step 20, local free memory block can not satisfy local demands of applications, memory pool driver module (rmp_pool) sends request to manager (rmp_mamager), require manager to seek free memory block for it, manager returns to the node of the request of sending with lookup result, sends between the node at the node of request and free memory block place to establish a communications link.
In this step, also relating to local node is telecomputing node storage allocation piece.When local node receives the manager requirement when being the request of Remote Node RN storage allocation piece, free memory block with respective numbers from idle chained list moves on on the chained list RLL, and the quantity of the actual memory block that provides returned to manager, by manager information is notified to the respective remote node.
Figure 13 has described the process of calculating node 1 application internal memory.Before the application internal memory, calculate node 2 and used the internal memory that calculates node 1, calculates node 4 by swapper, calculate node 1 and provide service to application program by mapper.In the illustrated time, the application of calculating on the node 1 is applied for internal memory by mapper to RMP.Calculate node 1 to manager application internal memory, manager is inquired free memory quantity to calculating node 4 and inner server (memserver) respectively, and the information of these two nodes is returned calculating node 1.Calculate node 1 according to the result who returns respectively with inner server with calculate node 4 and connect, and will calculate a part of internal memory that node 2 uses and transfer on the calculating node 4.
As shown in Figure 6, local node is received the notice that manager sends, and having reserved label for other application on the Remote Node RN is 2,3,4 free memory block, and wherein, memory block 2,3 has wherein been used in the application on the Remote Node RN., in chained list RLL, had their piece is described by the memory block of Remote Node RN application above-mentioned.
After the node at step 30, local node and free memory block place establishes a communications link, memory pool driver module (rmp_pool) is described transfer of data in the pairing physical block (often accessed piece) to Remote Node RN with the piece after the leaning among the chained list RLL, and these pieces are described be connected on the chained list RRL, the physical memory piece of being vacated uses for local the application, if resulting free memory block still can not satisfy demands of applications, then carry out next step.
Step 40, the piece of afterbody among the chained list LLL described be connected on the chained list LRL, the data that this piece is described in the pairing physical memory piece are sent on the physical memory piece of Remote Node RN, and the physical memory piece of being vacated is used use for local.
As shown in Figure 7, in the present embodiment, 6 memory blocks have been applied in this locality application of calculating on the node again, as previously mentioned, on idle chained list, also preserve a free memory block 5, this memory block is used for local the application, and on chained list LLL, add the piece description of this memory block.The local application also needs 5 memory blocks, but on local node, can't satisfy its needs, therefore send the request of application internal memory to manager, manager is searched other nodes on the remote internal memory sharing system, discovery is labeled as on the calculating node of N2 vacant memory block, so local node (can be designated as N1) and N2 establish a communications link.Then with in the free memory block of the transfer of data in the memory block on the chained list RLL 2,3,4 to the node N2, and add the piece that is transferred to the memory block on the Remote Node RN at chained list RRL and describe, piece on the chained list RRL is described except writing down the physical memory piece the numbering of this locality, also write down the numbering that the physical memory piece transmits data place node, on Remote Node RN, transmitted the numbering of data place memory block.On local node, resulting idle physical memory piece uses for local the application after the transfer of data.In Fig. 7, can see, be labeled as 6,7,8 memory block and taken the memory block (their physical address is identical) that memory block 4,3,2 originally occupied respectively.After this operation, the memory block of chained list LLL from linked list head to the chained list afterbody is memory block 8,7,6,5,1 in proper order.Because local the application applied for 6 memory blocks altogether, finished the request to 4 memory blocks in aforesaid operation, therefore, also needs 2 memory blocks.At this moment, chained list LLL is gone up transfer of data that the chained list trailer block describes pairing physical memory piece to Remote Node RN, and add corresponding piece at chained list LRL and describe.The physical memory piece can use for the application on the local node after self information is sent to Remote Node RN.Chained list LLL end is that the piece of memory block 1 is described, the corresponding physical block address memory is 0x5, recorded data is AAA, with transfer of data to Remote Node RN, and the piece of interpolation memory block 1 is described in chained list LRL, and the idle physical memory piece that will obtain then uses for local the application, obtains new memory block mark 9, its address is constant, still is 0x5.Then memory block 5 is done similar operation, obtain memory block 10.
In the description of Fig. 7, use the application of having finished the physical memory piece, in Fig. 8, be applied as each memory block of being applied for and added corresponding data.
Two, the visit of memory block.
Visit for memory block, can be divided into application on the local node to the visit of memory block, with application on the Remote Node RN visit, and when the realization of two kinds of accessing operations, can different realizations be arranged because of the difference of the memory block position that will visit again to memory block.Describe respectively with regard to the visit that local node is used and Remote Node RN is used memory block respectively below.
The local application by the information in " exchange " mode access memory piece according to the diverse location of memory block in chained list, has different realizations to the visit of memory block.
Step 50, local application access memory block.
Step 51, judge the position that the local piece of using the memory block that will visit is described,, carry out next step, be described on the chained list LRL execution in step 53 as if the piece of memory block if the piece of memory block is described on the chained list LLL.
The piece of step 52, the memory block that will visit is described on the chained list LLL, the piece of access memory piece is described mentioned the LLL linked list head, replaces original physical block address with the physical block address that imports into, gives application with the physical block address of originally preserving.
The data of step 53, the memory block that will visit (are promptly being described on the chained list LRL at the piece of this memory block) on the Remote Node RN, local node is made telecommunication with Remote Node RN, the piece of chained list afterbody on the chained list LLL is described between the physical memory piece at the data place that will visit on pairing physical memory piece and the Remote Node RN and is carried out exchanges data.The piece description chain that data is switched to the memory block of Remote Node RN is gone into the head of chained list LRL, after the piece of chained list LLL afterbody is described pairing physical memory piece and exchanged to the data that will visit, this piece is described the head that places chained list LLL.
As Fig. 9 and shown in Figure 10, the local application access memory block 1 among the local node N1.According to embodiment cited in the step 40 as can be known, the piece of memory block 1 is described on the chained list LRL, and promptly the data of this memory block are kept on the Remote Node RN N2.Therefore, take out the piece description of memory block 6 from the afterbody of chained list LLL, this piece is described the data GGG that preserves in the pairing physical memory piece to be sent on the Remote Node RN N2, Remote Node RN N2 then is sent to the data AAA that is originally preserved by memory block 1 on the local node N1, and be kept on the physical memory piece at memory block 6 places, the mark of this physical memory piece is described 1 corresponding to piece, and piece is described the head that places chained list LLL.Simultaneously, also the piece description chain of memory block 6 to be received the head of chained list LRL.When being applied in access memory piece 1, done swap operation, the address of the physical memory piece after the exchange is 0x0, and the data of preservation are ZZZ.
Application on step 60, the Remote Node RN is described pairing physical memory piece and is conducted interviews being kept at piece in the local RRL chained list.Piece is described on the RRL chained list, show that the data that this piece is described in the pairing physical memory piece are sent on the Remote Node RN, the memory location of data on Remote Node RN that this piece is described pairing physical memory piece returns to the application of Remote Node RN, and this piece described delete from the RRL chained list.
As shown in figure 11, the application on Remote Node RN will be visited the memory block 2 of local node.But being sent to Remote Node RN N2 in front the operation of the data in this memory block has gone up, and in chained list RRL, remain with corresponding piece and describe, this piece is described and is shown that the information in the memory block 2 is sent on the Remote Node RN N2, and is kept in the memory block 3 of this node.After above-mentioned information was sent to the application of wanting access memory piece 2 by network, this application was made further accessing operation to Remote Node RN N2, and the piece of deletion memory block 2 is described in chained list RRL.
Three, the release of memory block.
After the accessing operation end of application to memory block, discharge the memory block that does not visit again.
Step 70, when discharging memory block, preferential release block is described the physical memory piece that is positioned on the chained list LRL.
Step 80, release block are described the physical memory piece that is positioned on the chained list LLL, after memory block release finishes, inspection is on current chained list LRL, whether also have piece to describe, if also having piece describes, then utilize the piece discharged to describe the corresponding physical memory block, the data that save the data in the memory block (being that piece is described in the memory block on the chained list LRL) on the Remote Node RN local node that reads back.
As shown in figure 12, the application on the local node N1 discharges memory block 1,9 and 5.Memory block 5 is on Remote Node RN N2, and local node and Remote Node RN N2 communicate, and notifies this node to discharge memory block 5, and Remote Node RN places the piece description of the memory block that release obtains on the idle chained list of this node.The piece of memory block 1 is described and is positioned on the LLL chained list of local node, when discharging this memory block, directly the piece of this memory block is described and is deleted from chained list, and be clean with this piece descriptive markup, is placed in then in the idle chained list.Same, can do similar releasing operation to the memory block on the chained list LLL 9.When memory block 1 and 9 discharges end, check whether also have piece to describe on the current LRL chained list, by checking as can be known, the data of memory block 6 still are kept on the Remote Node RN N2, therefore the physical memory space that obtains after utilizing memory block 1 or memory block 9 to discharge, the data of memory block 6 are read back from Remote Node RN N2, and the piece description of memory block 6 is placed the LLL chained list.

Claims (5)

1, a kind of remote internal memory sharing system comprises manager and at least two calculating nodes, it is characterized in that, also comprises inner server; Wherein, promising described remote internal memory sharing system is installed on described calculating node the also memory pool driver module of managing physical memory block is provided, the described calculating node that has the memory pool driver module is connected on the network by connector, and described calculating node all is connected with described manager by network with described inner server; All calculate the memory pool driver module of node and form a memory pool for the internal memory that physical memory and inner server provided that remote internal memory sharing system provides, and the internal memory in the memory pool is shared for the application that each calculates in the node; On described calculating node, also comprise mapper that memory-mapped is provided and the interchanger that block device is provided; Described memory pool driver module also comprises to the kernel state application provides interface;
Described memory pool driver module is by idle chained list free_list, be used to link the LLL chained list that the local piece of using the physical memory piece of employed local node is described, be used to link the RLL chained list that the piece of the physical memory piece of the employed local node of remote application is described, be used to link the RRL chained list that the piece of the physical memory piece of the employed Remote Node RN of remote application is described, being used for linking LRL chained list that the local piece of using the physical memory piece of employed Remote Node RN describes manages the physical memory piece of whole remote internal memory sharing system;
In described remote internal memory sharing system, use and deposit a physical memory piece in this remote internal memory sharing system, in return, described remote internal memory sharing system offers physical memory piece of described application and uses; Described application has control completely to the physical memory piece of distributing to it, and can not directly use again for the physical memory piece that described application offers described remote internal memory sharing system, described application can only gain with another one physical memory piece by the numbering of described physical memory piece when needing, described remote internal memory sharing system guarantees that the piece content that gains is consistent with the content that had before deposited in, but what do not guarantee to gain is same physical memory piece.
2, remote internal memory sharing system according to claim 1 is characterized in that, described mapper is a character device of realizing the mmap interface.
3, remote internal memory sharing system according to claim 1 is characterized in that, described interchanger is a block device.
4, remote internal memory sharing system according to claim 1, it is characterized in that described inner server uses file system, by the page cache technology of operating system, with the memory source of this node as buffer memory, for the calculating node in the remote internal memory sharing system provides internal memory.
5, a kind of described remote internal memory sharing system of claim 1 that is applied to is realized the method that long-distance inner is shared, and comprising:
Step 10, an application application memory block that calculates on the node, the piece of free block at first transferred from the idle chained list free_list of this locality is described to and is used for linking the LLL chained list that the local piece of using the physical memory piece of employed local node is described, if in free_list, do not have the piece description of free memory block or the piece of free memory block to describe inadequately, then carry out next step;
Step 20, local free memory block can not satisfy local demands of applications, the memory pool driver module sends request to manager, require manager to seek free memory block for it, manager returns to the node of the request of sending with lookup result, sends between the node at the node of request and free memory block place to establish a communications link;
Step 30, after the node at local node and free memory block place establishes a communications link, the piece that the memory pool driver module will be used for linking the RLL chained list afterbody that the piece of the physical memory piece of the employed local node of remote application describes is described transfer of data in the pairing physical block to Remote Node RN, and these piece description chains are received the chained list RRL that the piece of the physical memory piece that is used to link the employed Remote Node RN of remote application is described, the physical memory piece of being vacated uses for local the application, if resulting free memory block still can not satisfy demands of applications, then carry out next step;
Step 40, the piece of afterbody among the chained list LLL described to be connected to be used to link the chained list LRL that the local piece of using the physical memory piece of employed Remote Node RN is described, the data that this piece is described in the pairing physical memory piece are sent on the physical memory piece of Remote Node RN, and the physical memory piece of being vacated uses for local the application;
Application access memory block on step 50, the local node comprises:
Step 51, judge the position that the local piece of using the memory block that will visit is described,, carry out next step, be described on the LRL execution in step 53 as if the piece of memory block if the piece of memory block is described on the chained list LLL;
The piece of step 52, the memory block that will visit is described on the LLL, the piece of access memory piece is described mentioned the LLL linked list head, replaces original physical block address with the physical block address that imports into, gives application with the physical block address of originally preserving;
The data of step 53, the memory block that will visit are on Remote Node RN, local node and Remote Node RN are made telecommunication, the piece that LLL is gone up the chained list afterbody is described between the physical memory piece at the data place that will visit on pairing physical memory piece and the Remote Node RN and is carried out exchanges data, the piece description chain that data is switched to the memory block of Remote Node RN is gone into the head of LRL, after the piece of LLL afterbody is described pairing physical memory piece and exchanged to the data that will visit, this piece is described the head that places LLL;
Application access memory block on step 60, the Remote Node RN, comprise that the application on the Remote Node RN describes pairing physical memory piece and conduct interviews being kept at piece in the local RRL chained list, the memory location of data on Remote Node RN that this piece is described pairing physical memory piece returns to the application of Remote Node RN, and this piece described delete from the RRL chained list;
Step 70, local use of using memory block finish, and discharge memory block, and preferential release block is described the physical memory piece that is positioned on the LRL;
Step 80, release block are described the physical memory piece that is positioned on the chained list LLL, after memory block release finishes, inspection is on current chained list LRL, whether also have piece to describe, if also having piece describes, then utilize the piece discharged to describe the corresponding physical memory block, the data that save the data in the memory block on the Remote Node RN local node that reads back.
CNB2006101648500A 2006-12-06 2006-12-06 A remote internal memory sharing system and its realization method Expired - Fee Related CN100486178C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006101648500A CN100486178C (en) 2006-12-06 2006-12-06 A remote internal memory sharing system and its realization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006101648500A CN100486178C (en) 2006-12-06 2006-12-06 A remote internal memory sharing system and its realization method

Publications (2)

Publication Number Publication Date
CN1972215A CN1972215A (en) 2007-05-30
CN100486178C true CN100486178C (en) 2009-05-06

Family

ID=38112821

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006101648500A Expired - Fee Related CN100486178C (en) 2006-12-06 2006-12-06 A remote internal memory sharing system and its realization method

Country Status (1)

Country Link
CN (1) CN100486178C (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635669B (en) * 2008-07-25 2011-11-09 中国科学院声学研究所 Method for acquiring data fragments in data-sharing systems
CN101478549B (en) * 2009-01-20 2011-10-05 电子科技大学 Operation method for memory sharing media server and functional module construction
CN101582031B (en) * 2009-06-16 2013-01-16 中兴通讯股份有限公司 Linked list management method based on structured language
CN101594309B (en) * 2009-06-30 2011-06-08 华为技术有限公司 Method and device for managing memory resources in cluster system, and network system
CN103838677B (en) * 2012-11-23 2018-07-06 腾讯科技(深圳)有限公司 The memory block method for releasing and equipment of a kind of shared drive
CN104426971B (en) * 2013-08-30 2017-11-17 华为技术有限公司 A kind of long-distance inner exchange partition method, apparatus and system
US20160179668A1 (en) * 2014-05-28 2016-06-23 Mediatek Inc. Computing system with reduced data exchange overhead and related data exchange method thereof
CN104216835B (en) * 2014-08-25 2017-04-05 杨立群 A kind of method and device for realizing internal memory fusion
CN104765572B (en) * 2015-03-25 2017-12-19 华中科技大学 The virtual storage server system and its dispatching method of a kind of energy-conservation
CN106155910B (en) * 2015-03-27 2021-02-12 华为技术有限公司 Method, device and system for realizing memory access
CN106155923B (en) * 2015-04-08 2019-04-12 华为技术有限公司 The method and apparatus of memory sharing
CN104793986B (en) * 2015-05-05 2018-07-31 苏州中晟宏芯信息科技有限公司 The virtual machine migration method of shared drive between a kind of node
CN105094997B (en) * 2015-09-10 2018-05-04 重庆邮电大学 Physical memory sharing method and system between a kind of cloud computing host node
CN111404986B (en) * 2019-12-11 2023-07-21 杭州海康威视系统技术有限公司 Data transmission processing method, device and storage medium
CN111212141A (en) * 2020-01-02 2020-05-29 中国科学院计算技术研究所 Shared storage system
CN112612429A (en) * 2021-01-06 2021-04-06 武汉飞骥永泰科技有限公司 iscsi target tgt architecture optimization method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5592625A (en) * 1992-03-27 1997-01-07 Panasonic Technologies, Inc. Apparatus for providing shared virtual memory among interconnected computer nodes with minimal processor involvement
US6205528B1 (en) * 1997-08-29 2001-03-20 International Business Machines Corporation User specifiable allocation of memory for processes in a multiprocessor computer having a non-uniform memory architecture
CN1347034A (en) * 2000-09-25 2002-05-01 汤姆森许可贸易公司 Data internal storage managing system and method, and related multiprocessor network
CN1391176A (en) * 2002-04-09 2003-01-15 威盛电子股份有限公司 Maintain method for remote node to read local memory and its application device
CN2569238Y (en) * 2002-07-01 2003-08-27 威盛电子股份有限公司 Reading local internal memory maintenance device by remote distance node in distributive shared internal memory system
CN1447257A (en) * 2002-04-09 2003-10-08 威盛电子股份有限公司 Data maintenance method for distributed type shared memory system
CN1553716A (en) * 2003-06-04 2004-12-08 中兴通讯股份有限公司 Clustering system for utilizing sharing internal memory in mobile communiation system and realizing method thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5592625A (en) * 1992-03-27 1997-01-07 Panasonic Technologies, Inc. Apparatus for providing shared virtual memory among interconnected computer nodes with minimal processor involvement
US6205528B1 (en) * 1997-08-29 2001-03-20 International Business Machines Corporation User specifiable allocation of memory for processes in a multiprocessor computer having a non-uniform memory architecture
CN1347034A (en) * 2000-09-25 2002-05-01 汤姆森许可贸易公司 Data internal storage managing system and method, and related multiprocessor network
CN1391176A (en) * 2002-04-09 2003-01-15 威盛电子股份有限公司 Maintain method for remote node to read local memory and its application device
CN1447255A (en) * 2002-04-09 2003-10-08 威盛电子股份有限公司 Distributed type system of shared momory possessing two nodes and data maintenance method
CN1447257A (en) * 2002-04-09 2003-10-08 威盛电子股份有限公司 Data maintenance method for distributed type shared memory system
CN2569238Y (en) * 2002-07-01 2003-08-27 威盛电子股份有限公司 Reading local internal memory maintenance device by remote distance node in distributive shared internal memory system
CN1553716A (en) * 2003-06-04 2004-12-08 中兴通讯股份有限公司 Clustering system for utilizing sharing internal memory in mobile communiation system and realizing method thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
内存映射文件及其在大数据量文件快速存取中的应用. 杨宁学,诸昌钤,聂爱丽.计算机应用研究,第8期. 2004 *
动态共享内存缓冲池技术. 余翔湛,殷丽华.哈尔滨工业大学学报,第36卷第3期. 2004 *
基于内存映射文件的数据共享技术研究与应用. 孙文庆,刘秉权,肖镜辉.微计算机应用,第26卷第2期. 2005 *

Also Published As

Publication number Publication date
CN1972215A (en) 2007-05-30

Similar Documents

Publication Publication Date Title
CN100486178C (en) A remote internal memory sharing system and its realization method
CN102546782B (en) Distribution system and data operation method thereof
CN102629941B (en) Caching method of a virtual machine mirror image in cloud computing system
CN103345451B (en) Data buffering method in multi-core processor
CN105549905A (en) Method for multiple virtual machines to access distributed object storage system
CN104580437A (en) Cloud storage client and high-efficiency data access method thereof
CN105518631B (en) EMS memory management process, device and system and network-on-chip
EP2288997A2 (en) Distributed cache arrangement
US9390010B2 (en) Cache management
CN104270412A (en) Three-level caching method based on Hadoop distributed file system
CN108519856B (en) Data block copy placement method based on heterogeneous Hadoop cluster environment
CN102136993A (en) Data transfer method, device and system
CN102129434A (en) Method and system for reading and writing separation database
CN103095788A (en) Cloud resource scheduling policy based on network topology
CN102262512A (en) System, device and method for realizing disk array cache partition management
CN111737168A (en) Cache system, cache processing method, device, equipment and medium
CN104580422A (en) Cluster rendering node data access method based on shared cache
CN104715044A (en) Distributed system and data manipulation method thereof
JP2003099384A (en) Load-sharing system, host computer for the load-sharing system, and load-sharing program
CN102123318B (en) IO acceleration method of IPTV application
CN102063407B (en) Network sacrifice Cache for multi-core processor and data request method based on Cache
CN110447019B (en) Memory allocation manager and method for managing memory allocation performed thereby
CN116149814A (en) KAFKA-based data persistence task distributed scheduling method and system
US20220004330A1 (en) Memory pool data placement technologies
CN112579528B (en) Method for efficiently accessing files at server side of embedded network file system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090506

Termination date: 20201206

CF01 Termination of patent right due to non-payment of annual fee