WO2000057309A2

WO2000057309A2 - Database and interface for 3-dimensional molecular structure visualization and analysis

Info

Publication number: WO2000057309A2
Application number: PCT/US2000/007474
Authority: WO
Inventors: Kalyanaraman Ramnarayan; Behnam Vessal; Jeyapandian Kottalam; Cindy L. Fisher; Saied Moezzi; Muthuchidambaram Prabhakaran
Original assignee: Structural Bioinformatics, Inc.
Priority date: 1999-03-19
Filing date: 2000-03-20
Publication date: 2000-09-28
Also published as: EP1163610A2; AU3906000A; JP2002540508A; WO2000057309A3

Abstract

A molecular structure database system collects multiple data files relating to the same molecule in the same subdirectory, and provides an interface to access all of the collected files from the same molecule using a graphical user interface (GUI) program. The collected files can comprise a variety of information and computer file formats, depending on the type of information to be conveyed to users of the database. A user communicates over a shared network with a secure file server that controls access to the collected files, and the interface to the collected files is provided by a GUI program or client applets. This provides a convenient means of searching molecular structure data for characteristics of interest. Data searching, file viewing, and investigation of multiple representations of molecular structures can be carried out from within a single viewing program.

Description

DATABASE AND INTERFACE FOR 3-DIMENSIONAL MOLECULAR STRUCTURE VISUALIZATION AND ANALYSIS

5 CROSS-REFERENCE TO RELATED APPLICATION

For purposes of International patent applications, this application claims benefit

of priority from U.S. Non-provisional patent application Serial No. 09/272,814 entitled

"Database and Interface for 3 -Dimensional Molecular Structure Visualization and

Analysis" filed March 19, 1999. Forpurposes ofanyU.S. application, priority is claimed

0 from U.S. Non-provisional patent application Serial No. 09/272,814 entitled "Database

and Interface for 3-Dimensional Molecular Structure Visualization and Analysis" filed

March 19, 1999. Where permitted, this application incorporates by reference the

referenced pending U.S. patent application in its entirety for all purposes.

5 BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates generally to database access and, more particularly, to

interfaces that control access to collections of data.

2. Structure-Based Drug Design

o Recent advances in molecular biology, such as the discovery and identification

of large numbers of novel gene sequences encoded in the genomes of humans, other

mammals and infectious disease agents, have contributed to the identification of a large

number of target proteins and other biological macromolecules. Within the decade, up to 150,000 fully sequenced human genes and an estimated 400,000 genes from other

species will be available.

Based on information derived from these sequences, as well as from experimental

methods, such as X-ray crystallography or NMR, or protein structure determination

5 methodologies, such as homology modeling or ab initio structure prediction, three-

dimensional (3-D) molecular structures of enzymes, ligands or target receptors, such as

protein or other macromolecular receptors, are being determined at increasing rates. 3-D

molecular structure is known to be related to biological function. By employing

structure-based drug discovery methodologies, including structure-directed combinatorial

0 or molecular diversity screening and computational screening using molecular similarity

or computational docking algorithms, it is possible to derive knowledge of 3-D structure

of biomolecules, which is useful, for example, in facilitating the identification of

pharmacophores and the design of biologically-active compounds, such as small

molecule agonists or antagonists.

5 As the number of available biomolecular structures increases, there is an

increasing need for capabilities, such as databases, for organizing and providing access

to the structures to make them available for use in structure-based discovery and design.

3. 3-Dimensional Molecular Structures

3-D molecular structures of proteins or other molecules are represented as sets of

o atomic coordinates that describe the spatial arrangement and intramolecular connectivity

of the atoms in the molecule. For example, a standard format of representing

macromolecular structural data is specified by the Protein Data Bank (PDB) format. The PDB format organizes molecular structure data by representing atoms according to their

atom types and atomic coordinates. In addition to atomic coordinate data, a PDB format

data file can include information on structural attributes or reactivity of the molecule,

such as active sites or secondary structure attributes. The molecular structure data in a

PDB data file is provided as a simple alphanumeric representation of the atomic

coordinates for a single protein or other macromolecule. Thus, it is stored as a text file

using standard ASCII display codes.

The PDB is a depository of molecular structure data for biological

macromolecules, that is, researchers are invited to add molecular structure data in the

PDB format to the collection currently maintained at the Brookhaven National

Laboratory (Bernstein et al, "The protein data bank: a computer-based archival file for

macromolecule structures", J. Mol. Biol., 112, 535-542 (1977)). To facilitate maximum

accessibility to the data contained in the PDB, the data files are made available over the

Internet and can be retrieved and viewed by a user. Unfortunately, the PDB collection

of data files is simply an archive comprising a series of molecular data files stored

serially, as the files are deposited by researchers. That is, the PDB data files are flat files.

Thus, there is no logical structure or interrelationship among the data files or between the

records of different files. If a user is interested in viewing the atomic structure of a

particular protein, for example, then the user must conduct a text-based search of the

alphanumeric data in each file, one after the other, until the desired protein name or

structure, or other associated information, is located. Text searching through the

molecular data files at a depository such as the PDB might not effectively identify the desired molecular structural information. Thus, it would be advantageous if molecular

structure data files could be conformed to a logical database design, to permit more

convenient and efficient searching for data records.

The coordinates of molecular structures can be read or downloaded into a

molecular graphics program for visualization and manipulation and structural analysis.

For example, a researcher can textually search the PDB data file to locate a protein of

interest, and then can import the located file into a viewer program to obtain a visual

representation of the protein. Several such graphics packages are known and are readily

available commercially. Molecular graphics programs read in atomic coordinate data,

such as is contained in a PDB-format file, and perform calculations to construct a three-

dimensional representation of the molecule. Additionally, data beyond simple molecular

structure, such as molecular shape analysis, energetic or strain analysis, active or reactive

sites, variants or properties such as electrostatic potential or hydrophobicity, among

others, may be associated with a given molecular structure. It would also be desirable

to provide access to this information, as well. Currently, many of these data files are

generated and must be viewed using different programs. To date, there is no software

package available that integrates a molecular graphics program with a relational database

to permit navigation among related molecular structures stored in a database and

visualization analysis of molecular structures and their related properties within a single

user interface. Thus, it would be further advantageous to integrate a molecular structure

database with molecular visualization and analysis tools, as navigating among the different databases and viewing programs can be time consuming and can be confusing,

as each program may have a different scale or look and feel.

4. Proprietary Data Issues

Recently, techniques such as homology modeling and other structure prediction

methodologies have been used to generate models of enzymes, ligands or target

receptors, such as protein or other macromolecular receptors, structures whose structures

have not yet been determined experimentally. The databases described herein can

provide access to these structures to researchers and others, such as clinicians or

educators. The databases can be stored on large networks, such as the Internet, which

provide a convenient means of data dissemination, but are most well-known for

providing unrestricted access to data uploaded to Internet servers.

In some cases, however, it might be desirable for the proprietor of a molecular

structure database to carefully control access to their respective collections of molecular

structure data files. For example, such data files may be obtained as a result of

proprietary structure prediction algorithms or methods. In such cases, access to such

information might be limited to only certain structures or families of structures.

Thus, it would be additionally advantageous to provide an integrated system to

permit controlled access to a database of molecular structures and related properties

within a single user interface. SUMMARY OF THE INVENTION

Described herein is a database and interface for access to 3-D molecular structures

and associated properties, which can be used to facilitate the design of potential new

therapeutics. The interface also provides access to other structure-based drug discovery

tools and to other databases, such as databases of chemical structures, including fine

chemical or combinatorial libraries, for use in structure-focused high-throughput

screening, as well as to a host of public domain databases and bioinformatics sites.

The invention provides a relational database that collects multiple data files

relating to the same molecular structure in the same subdirectory, and provides an

interface to access all of the collected files from the same structure using the same user

interface program. The collected files can comprise a variety of information and

computer file formats, depending on the type of information to be conveyed to users of

the database. In accordance with the invention, a user communicates over a public shared

network, such as the Internet, or over a controlled private network, such as an intranet,

with a secure server that controls access to the collected files, and the interface to the

collected files is provided by a graphical user interface. In this way, the invention

provides a convenient means of searching molecular structure data for characteristics of

interest. The invention also permits data searching, file viewing, and investigation of

multiple representations of molecular structures from within a single viewing program.

In one aspect of the invention, the data files are made available over a wide area

shared network, such as the Internet, and the graphical user interface (GUI) used for

viewing the data files is a standard Internet web browser program, such as the web - 1-

browser products by Netscape Communications, Inc. and Microsoft Corporation. Such

browser products readily import and provide views of files having a wide variety of

formats that contain alphanumeric, video, and audio data. In another aspect of the

invention, the GUI is provided with a platform independent programming environment

using client applets, such as Enterprise Java Beans (EJB). A security server is preferably

located between the user browser program at a network client machine controls access

to the database, which is housed at a file server connected to the security server. Before

a user gains access to the database, the security server checks authorization for the

individual user and then, if appropriate, permits downloading of appropriate data from

the database file server.

In another aspect of the invention, data for a molecular structure is loaded into the

database by specifying the file pathnames for the various data files that contain the

different types of data, including the different molecule views. Using a browser to view

the data files permits various helper applications, called plug-ins, to smoothly and

transparently accept the different file formats and provide views to the user. The various

data files of the database are organized in accordance with the database design when they

are loaded into the database and are managed by a relational database management

program.

Other features and advantages of the present invention should be apparent from

the following description of the preferred embodiment, which illustrates, by way of

example, the principles of the invention. BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a block diagram of a network system constructed in accordance with

the present invention.

Figure 2 is a block diagram showing the primary functional components of the

5 database server illustrated in Figure 1.

Figure 3 is a block diagram showing the primary hardware components of the

database server and security server machines illustrated in Figure 1.

Figure 4 is a representation of a Login screen shown to a user at a client machine

display of the network system illustrated in Figure 1.

l o Figure 5 is a representation of a selection screen shown to a user following proper

login through the display illustrated in Figure 4.

Figure 6 is a representation of a selection screen shown to a user following

selection of a database option from the display illustrated in Figure 5.

Figure 7A and Figure 7B show a representation of a protein families display

5 screen showing an index of the protein families available for selection from the database

server illustrated in Figure 1.

Figure 8 is a representation of a query submission screen shown to a user

following selection of the query option from the display illustrated in Figure 6.

Figure 9A and Figure 9B show a protein database listing generated in response

0 to a query submitted from the display illustrated in Figure 8.

Figure 10 is a representation of a query submission screen for a search based on

a particular disease state. Figure 11 is a representation of a protein information screen such as might be

generated in response to a query or in response to selection of a protein from the protein

name display illustrated in Figure 9.

Figure 12A, 12B, 12C, 12D, 12E, and 12F show protein structural data as stored

in a PDB-format data file, comprising one of the data files stored in the database server

illustrated in Figure 1.

Figure 13 is a representation of a protein Visualization Toolkit screen for a

protein selected from the database server illustrated in Figure 1.

Figure 14 is a representation of a viewing screen displaying a 3 -dimensional view

of a protein structure selected from the database server illustrated in Figure 1.

Figure 15 is a representation of the functionality contained in the Measure menu

showing a graphical display of the interatomic distance between two atoms of the protein

molecule shown in Figure 14.

Figure 16 is a representation showing a graphical display of the angle between

three atoms in the protein molecule shown in Figure 14.

Figure 17 is a representation showing a graphical display of the dihedral angle

between four atoms in the protein molecule shown in Figure 14.

Figure 18 is a representation showing a graphical display of the atomic

coordinates for an atom in the protein molecule shown in Figure 14.

Figure 19 is a representation of the Sequence Viewer window showing the amino

acid sequence of the protein molecule shown in Figure 14. Figure 20 is a representation of the Sequence Alignment window showing the

alignment of the amino acid sequence of the protein molecule shown in Figure 14 with

a template sequence.

Figure 21 is a representation of the Secondary Structure prediction window

showing the amino acid sequence of the protein molecule shown in Figure 14 and the

predicted secondary structural features of the protein.

Figure 22 is a representation of the Visualization menu selections for the protein

molecule shown in Figure 14.

Figure 23 is a representation of the Secondary Structure Ribbon menu selections

for the protein molecule shown in Figure 14.

Figure 24 is a representation of the Quality menu selections and shows a

Ramachandran plot for the protein molecule shown in Figure 14.

Figure 25 is a representation of the Quality menu selections and shows a

Balasubramanian plot for the protein molecule shown in Figure 14.

Figure 26 is a representation of the Quality menu selections and illustrates the

Ellipsoid functionality for the protein molecule shown in Figure 14.

Figure 27 is a representation of the Surface Hydrophobicity menu selection for

the protein molecule shown in Figure 14.

Figure 28 is a representation of the Strain Plot menu selection for the protein

molecule shown in Figure 14.

Figure 29 is a representation of the Profile Analysis menu selection for the protein

molecule shown in Figure 14. Figure 30 is a representation of the Align functionality showing a list of proteins

to select for superimposing on the protein molecule shown in Figure 14 and a window

for displaying the superposition of multiple proteins.

Figure 31 A and Figure 3 IB show a display screen for entering data about a

protein into the database server illustrated in Figure 1.

Figure 32 is a flow diagram representation of operations performed when a user

accesses the protein database of the system illustrated in Figure 1.

Figure 33 is a flow diagram representation of the operations performed during the

security checking operation box of Figure 25.

Figure 34 is a representation of the database schema used by the relational

database management system illustrated in Figure 1.

Figure 35 shows the object classes that produce the screen displays from the

browser program at a user terminal illustrated in Figure 1.

Figure 36 is a block diagram representation of an alternative network system

constructed in accordance with the present invention, using a distributed architecture and

"Enterprise Java Beans" components.

Figure 37 is a block diagram representation of the "Enterprise Java Beans"

components in the Figure 36 system.

Figure 38 is a representation of an Application screen shown to a user at a client

machine display of the network system illustrated in Figure 36, with a Family Tree panel

shown at the left side of the application window. Figure 39 is a representation of the Application screen shown to a user at a client

machine of the network system illustrated in Figure 36, illustrating a Search panel shown

in the application window.

Figure 40 is a representation of the Application screen shown to a user at a client

machine of the network system illustrated in Figure 36, illustrating a Data Mining panel

shown at the left side of the application window with a Structure visualization panel at

the right side of the application window.

Figure 41 is a representation of the Application screen shown to a user at a client

machine of the network system illustrated in Figure 36, illustrating simultaneous Quality

display features for two selected proteins in the Visualization panel of the application

window.

Figure 42 is a representation of the Application screen shown to a user at a client

machine of the network system illustrated in Figure 36, illustrating simultaneous

Structure display features for two selected proteins in the Visualization panel of the

application window.

Figure 43 is a representation of the Application screen of Figure 42, showing the

drop down menu for selection of Display features.

Figure 44 is a representation of the Application screen of Figure 42, showing the

drop down menu for selection of Options for atoms to be viewed.

Figure 45 is a representation of the Application screen shown to a user at a client

machine of the network system illustrated in Figure 36, illustrating simultaneous Sequence display features for two selected proteins in the Visualization panel of the

application window.

Figure 46 is a representation of the database objects for the database design of the

system illustrated in Figure 36.

5

DESCRIPTION OF THE PREFERRED EMBODIMENTS

The invention will be better understood with reference to the attached drawings,

in which like reference numerals refer to like objects. For purposes of illustration, the

system is described with respect to a database of protein structures. It should be

o understood that the system could also be adapted for use with databases of 3-D structures

of other biological molecules or macromolecules, such as DNA, RNA or carbohydrates.

1. System Components

Figure 1 is a block diagram of a network system 100 constructed in accordance

with the present invention. The system includes a database server 102 that communicates

5 over a high speed communications line 104 with a security server 106. Multiple user

client machines 108, 110, 112 are shown in communication with the security server 106

over high speed network communications lines 114, 116, 118, respectively, to gain access

to data stored at the database server 102. The database server stores protein data in a

relational protein database that can be searched by protein family and by a variety of

o protein characteristics, so that files relating to the same protein are stored in the same

subdirectory. The files comprise a wide variety of protein data, and can include PDB-

format files, text files, graphic image files, virtual reality files, video files, and audio files. Each of the client machines 108, 110, 112 comprise a network terminal that permits a

user to gain access and retrieve data using a resident browser on their respective client

machine. The browsers provide a graphical user interface to access all of the collected

database files and view the data without changing the interface.

In accordance with the invention, a user at one of the terminals 108, 110, 112

communicates over a public network connection 114, 116, 118 after receiving

authorization from the security file server 106, which controls access to the collected

files. Access to the security server and collected files at the database server 102 is

provided through the graphical user interface of the conventional browser program that

is widely available, such as the Communicator or Navigator browser product from

Netscape Communications Corp. or the Internet Explorer browser product from

Microsoft Corporation. In this way, the invention provides a convenient means of

searching among protein families for proteins with characteristics of interest. The

invention also permits data searching, file viewing, and investigation of multiple views

of proteins from within a single viewing program.

Each of the network terminals 108, 110, 112 comprise a computing platform that

typically includes a central processor unit (CPU), such as the Pentium-class of

microprocessor chips manufactured by Intel Corporation or the CPU microprocessors

manufactured by Silicon Graphics, Inc. The terminals also include a video display unit,

such as a video monitor or other display screen, and a network interface. In the preferred

embodiment, each network terminal also includes another means of accessing data on

storage media, such as a floppy disk drive or a CD-ROM or DVD-ROM drive. Because each network terminal 108, 110, 112 communicates with the security server 106 using

the same type of graphical user interface, access to the security server does not depend

on the operating system of each terminal being the same. For example, each of the

illustrated terminals 108, HO, 112 may use a different operating system. Thus, the first

client machine 108 may function using the "Windows 95" operating system program

from Microsoft Corporation. The second client machine 110 may function using the

Unix operating system, and the third client machine 112 may operate using the Apple

Macintosh operating system. Access to the security server will be transparent to the

users at these client machines 108, 110, 112 as long as the browser at each terminal is

operable with the respective client operating system.

As noted above, the client network terminals 108, 110, 112 communicate through

a web browser interface to the security server 106 and then gain access to the database

at the database server 102. Figure 2 is a block diagram of the database server 102. The

database server is a computing machine that supports a relational database system 202,

such as the systems manufactured by Oracle Development Corporation, which accesses

the collection of data files 204 comprising the protein database of the system 100. That

is, the relational database system 202 provides access, security control, and search and

retrieval services for the database files 204 that are stored at the database server 102. The

database files may include a wide variety of file types, including alphanumeric text files,

graphic image files, video files, and audio files.

The database server 102 also includes a web application server 206, which

interfaces with the relational data base system 202 to retrieve and, if necessary, process the data files 204. The web application server may include programs known as

cartridges. Those skilled in the art will be familiar with cartridges and with the tasks they

can perform, and will understand that cartridges specific to particular database systems

can be readily developed by users of the database system 202. As described further

5 below, for example, cartridges are used to control access to the database, so that data

subscribers of different authorizations are provided with different access levels to the

files, or will be granted access with different exceptions. The database server 102 also

includes a controller 208, which provides the operating system at the server. The

operating system in the preferred embodiment supports various program types, including

l o those written in the "Java" programming language developed by Sun Microsystems, Inc

of Palo Alto, California, USA.

Figure 3 is a block diagram of the primary hardware components of the computers

that comprise both the database server 102 and the security server 106. It should be

understood that the functions of the database server and the security server can be

15 implemented on the same computer, if desired. In the preferred embodiment, the

functions are performed at separate computers to facilitate database maintenance at the

database server and increase the rate at which new data can be uploaded and made

available at the database. In addition, the client network terminals 108, 110, 112 may

each have a construction similar to that shown in Figure 3. As noted above, each client

0 machine 108, 110, 112 may function with a different operating system, as may the

servers 102, 106, the only requirement being that they can communicate with each other

over the network connections 104, 114, 116, 118. The exemplary computer 300 includes a CPU 302 that provides substantial computing power to successfully handle a

large volume of user traffic and to manage transfer of large amounts of data. For

example, when a user requests the database files for a single protein, the amount of data

transferred from the database server 102 can comprise more than 2 megabytes (MB) of

data. The computer 300 also includes memory 304, which typically includes 64 MB or

more of fast random access memory (RAM). The computer will also include a network

interface 306 to permit high speed communications with anetwork 308, such as provided

by direct connection with Tl data lines, a frame relay connection, or other high

bandwidth digital data line. The computer also will include an external storage device

interface 310 to permit transfer of data, including programming, from external storage

media 312, such as floppy disks, magnetic tape, CD-ROM, and DVD-ROM storage

media. Internal data storage, of persistent data too large to fit within the memory 304,

may be stored in disk storage 314. Presently, readily available disk storage provides

space of six to eight gigabytes (GB).

A user will initiate action and input data from a combination of input devices 320,

such as a keyboard and display mouse, and will view data on a display 322 such as a

video monitor. Typically, the database server 102 and security server 106 will be more

powerful machines than the network terminals 108, 110, 112, because the servers must

potentially handle a large amount of communications traffic. The network terminals

must be able to communicate with the security server and receive large amounts of data,

but the required speed of such data transfer is to be determined by the user preference.

It has been found, however, that a network connection of greater than 56K/sec data transfer is most desired, and slower terminal-to-security server communications will not

be satisfactory to most users.

2. Using the System to Retrieve Data

The first step in gaining access to the stored database and the data files contained

therein is to obtain access to the database server 102. To gain access to the server, a user

must first receive access through the security server 106 by standard conventional

communication techniques. For example, the Internet protocol (IP) address of the

security server must be known, so that a user 108, 110, 112 can communicate with the

server 106 through the user's browser program. Upon communicating with the security

server 106, the user is presented with the first login screen at the browser interface.

Figure 4 is a representation of a login screen 402 that is shown to a user 108, 110, 112

(Figure 1) at a client machine display unit, viewing through the user's conventional

browser window. The login screen identifies the database to be accessed, and provides

a user with a "login" request button 404 to be selected (clicked) using the client machine

display mouse, keyboard, or other window designation device. No program response will

occur until the login button is selected.

After a user clicks on the login button 404, the security server will provide the

user's browser with a display screen that shows a list of database selection features, or

utilities, from which one must be chosen. Figure 5 is a representation of the information

box screen 502 that is shown to a user following selection of the login button in the

screen display illustrated in Figure 4. In the Figure 5 representation, two of the illustrated

database selections are shown for variety, but do not form any part of the invention; these are the selections for "SVdBase" 504 and the "CombiLib" 506. Other permitted database

selections may be included. The other two database selections shown in Figure 5, for

"SBdBase" 508 and for "SBdBasePlus" 510, are used to select access to the unique

relational database of protein structural data provided in accordance with the present

invention. The two choices 508, 510 select different levels of access to the database files.

The different levels of access may be, for example, to different protein families, so that

certain protein families may be made available only at a higher fee. The different levels

of access also may restrict access to files within a given protein family, so that certain

types of data may be made available only at a higher fee.

After a user selects one of the database access levels in Figure 5, the security

server will present the user with a database selection screen. Figure 6 is a representation

of the database selection screen 602 shown to a user following selection of the database

option from the display illustrated in Figure 5, and shows that a user can select between

an index view display button 604 and a query display button 606. If the index view

button 604 is selected by a user, then the security server will return a list of protein

families that are available for viewing. In the preferred embodiment, all protein families

are shown in the index view, according to protein family name, even if the user does not

have authorization to view them. In this way, users will be kept informed of the full

database that is potentially available, including recently added data for which they may

not have access. Thus, users may decide to upgrade their level of access upon viewing

the complete list and determining the available proteins to which they do not currently

have access rights. Figure 7A and Figure 7B comprise a display screen list 702 that shows an exemplary index listing of the available proteins displayed by selecting the

index button 604. If the user selects the query display button 606 from the Figure 6

display, then the security server will return a query screen so the user can designate a

query on the database.

5 Figure 8 is a representation of a database query display screen 802 shown to a

user following selection of query submission from the display screen illustrated in Figure

6. The query screen 802 shows a display area 804 that lists query fields over which

searching can be performed. In the preferred embodiment, these fields include a database

identification (ID) number, protein name, species, gene, conventional "SwissProt"

0 accession number, disease state, protein function, protein family, and full-text search

string. After the user has determined the search query, the user selects the "Submit

Query" display button 806 and the user's browser sends the query information to the

security server.

Figure 9A and Figure 9B illustrate a query results display screen 902 that is

5 shown to a user following selection of the query option from the display illustrated in

Figure 8, where a user has selected the "Protein Name" field for search. Thus, the Figure

9 A and 9B display shows a list of protein names, corresponding to the names of proteins

available in the database at the database server 102 (Figure 1). That is, the left-most

column 904 in Figure 9A, 9B lists the protein name, the next column 906 shows the

o corresponding database ID number, the next column 908 contains an indication (where

applicable) of whether PDB data is available, and the last column 910 contains an indication (where applicable) of whether homology data is available for the named

protein.

Figure 10 is a representation of the query display screen first illustrated in Figure

8, except that the Figure 10 display screen 1002 shows that "cancer" has been inserted

5 into the "Disease" state field 1004 so that a query is executed that will search for proteins

related to cancer. More particularly, when the "Submit Query" display button 1006 is

selected in Figure 10, the security server will receive the "cancer" disease query and will

cause a search to be executed over the database files by the database server 202 (Figure

2). The search results will be returned by the security server to the user's browser.

o Figure 11 is a representation of a protein information screen 1102, such as might

be generated in response to a query or in response to selection of a protein from the

protein name display illustrated in Figure 9. The Figure 11 screen shows that, in the

preferred embodiment, the protein information includes the database ID number 1104,

the complete protein name 1106, the species 1108 for the protein data file, the gene type

5 1110, predetermined database keywords 1112 for the protein, disease information 1114,

function information 1116, protein family class 1118, the SwissProt accession number

1120, and the EC number 1122. These data contained in these fields will be familiar or

readily understood to those skilled in the art. A row of display selection buttons below

the information fields of Figure 11 show that a user may select to retrieve the

o corresponding data file for the protein from the PDB-format database 1130, or may select

the SwissProt data 1132 for the protein, or may select "GenBank" data 1134, or may

select "PIR" data 1136 for the protein. Of these alternatives, it should be noted that only the PDB-format database contains multiple data types, other than protein sequence data.

That is, the SwissProt, GenBank, and PIR databases contain only sequence data and not

molecular structure data. The user may decide to begin another query by selecting the

"New Query" display button 1138.

5 3. The Protein Visual Display Tools

As noted above, protein structure data is alphanumeric data that can be searched

using text strings, but is not meaningful to researchers without the aid of viewing

programs. Figure 12A, 12B, 12C, 12D, 12E, 12F (collectively referred to as "Figure 12")

is a representation of protein structure data as stored in a PDB-format data file,

o comprising one of the data files stored in the database server illustrated in Figure 2. The

standard PDB format of the protein data shown in Figure 12 will be familiar to those

skilled in the art. It should be apparent that such data is tedious to review and is not

especially revealing of protein structures and characteristics. The database system of the

present invention organizes such data into a coherent database representation and permits

5 a visual interface to such data through a conventional Internet web browser interface.

Figure 13 is a representation of a protein "Visualization Toolkit" display screen

1302 for a protein selected from the database server 202 (Figure 2) using the interface of

the present invention. In response to the user selecting a protein name from the

"database" button of Figure 11, the security server causes the visualization toolkit

o interface program to be launched and displayed within the user's browser, and returns the

Figure 13 display page as the opening page of the visualization toolkit interface program.

Thus, a user will always be provided with the display of Figure 11 after selecting a protein through either the protein name displays or from executing a query, so that

selection of the visualization toolkit may be made for the desired protein.

Figure 14 is a representation of a viewing screen displaying a 3-dimensional view

of a protein structure selected from the database server illustrated in Figure 1. When a

structure is read into the viewer, the atoms are displayed and can be colored according

to atom type. The structure can be manipulated within the 3-D graphics window using

the computer mouse to control its movement. Across the top of the screen are six

pulldown menu selections under which are commands for structure visualization and

analysis. In some cases, commands can only be executed if precalculated information

is stored in data files associated with a particular protein in the database. If such

information is not available for a given protein, the command that requires the

information will be blanked out and will not be accessible for that protein. Shown at the

right side of the screen is the atom identifier information for any atom selected in the

structure (e.g. K 224, CA), as well as commands for clearing any screen annotations,

downloading coordinates for additional structures, for accessing the database index and

for submitting a new database query.

Figure 15 is a representation of the functionality contained in the Measure menu.

The figure shows a graphical display of the interatomic distance between two atoms in

the protein, which is created by selecting the Distance command and picking the two

atoms using the cursor and mouse. The Angle (Figure 16) and Dihedral (Figure 17)

commands calculate and display angle values when the user picks either three or four consecutive or non-consecutive atoms in the structure. The Coordinate (Figure 18)

command displays the atomic coordinates for a selected atom.

The View pulldown menu includes functionality for displaying selective atoms

in the protein, for example, all atoms or only main chain or alpha carbon atoms.

The Sequence menu includes commands for amino acid sequence analysis. The

Sequence command brings up a separate window in which the protein sequence is

displayed according to single letter amino acid codes and colored according to residue

type. Figure 19 shows a representation of the Sequence Viewer window showing the

amino acid sequence of the protein molecule shown in Figure 14. The sequence and

structure windows are interactive in that placement of the cursor on an amino acid code

in the sequence highlights the position of that amino acid in the structure. The sequence

window is independent of the main molecular structure window in that it can be moved,

resized and closed as a separate frame. These capabilities are due to functionality in the

"Java" language and are applicable among all windows present in the system.

The Alignment functionality (Figure 20) shows the alignment of the protein

molecule with one or more template sequences. The sequence alignment is calculated

using any one of a number of sequence alignment algorithms known to those of skill in

the art, for example, programs such as MSA (Carrillo and Lipman, "The multiple

sequence alignment problem in Biology", SIAM J. Appl. Math. 48, 1073-1082 (1988);

Altschul and Lipman, "Trees, stars and multiple biological sequence alignment", SIAM

J. Appl. Math., 49, 197-209 (1989); Altschul, "Gap costs for multiple sequence

alignment", J. Theor. Biol., 138, 297-309 (1989); Altschul et al, "Weights for data related by a tree", J. Molec. Biol, 207, 647-653 (1989); Altschul, "Leaf pairs and tree

dissections", SIAM J. Discrete Math., 2, 293-299 (1989); Lipman et al, "A tool for

multiple sequence alignment", Proc. Natl. Acad. Sci. USA, 86, 4412-4415 (1989)) or

ClustalW (Higgins et al, CABIOS, 8, 189-191 (1991)), which are available in the public

domain, can be used.

The Secondary Structure command (Figure 21) opens an additional window in

which is displayed information about the predicted secondary structure of the protein, for

example, whether a particular residue is involved in a helix, coil or sheet. This

information is precalculated and stored in the database, for example, by using a publicly

available algorithm for calculating secondary structure, such as SSPAL (Salamov and

Solovyev, "Prediction of protein secondary structure by combining nearest-neighbor

algorithms and multiple sequence alignments", J. Mol. Biol. 247, 11-15 (1995)) or can

be calculated interactively.

molecule shown in Figure 14. This menu contains functionality for visualizing different

structural attributes of the displayed molecule. The Hydrophobicity command

interactively colors the residues of the protein according to a color scheme based on a

predetermined hydrophobicity scale for the individual amino acid residues. This allows

identification of hydrophobic and hydrophilic regions in the protein structure. The

Active Sites, Glycosylation, Phosphorylation and Natural Variants commands display

these sites in a protein molecule, if known, in accordance with data which has been

previously determined and stored in association with the particular protein. The Secondary Structure Ribbon command (Figure 23) opens a window in which

is displayed a solid ribbon along the protein backbone highlighting the secondary

structural attributes of the protein, for example, helices, sheets or coils. The ribbon is

precalculated, such as by using the publicly available Ribbons program (Carson and

Bugg, "Algorithm for ribbon models of proteins", J. Mol. Graphics, 4, 121-122 (1986);

Carson, "Ribbon models of macromolecules", J. Mol. Graphics, 5, 103-106 (1987);

Carson, "Ribbons 2.0", J. Appl. Cryst, 24, 958-961 (1991); Carson, "Ribbons", Methods

in Enzymology, R.M. Sweet and C.W. Carter, eds, Academic Press, 277, 493-505

(1997)) and is stored in the database in a data file associated with a given protein. The

solid surface is displayed in the graphics window, for example, by plugging in a utility

such as the SGI Cosmo Player to Netscape. Also precalculated and stored along with the

protein are electrostatic and dynamic surfaces, which are called up and displayed as solid

surfaces in a separate window using the Electrostatic Surface and Dynamic Surface

commands, respectively. The electrostatic surface is precalculated using an available

program, such as GRASP (Nicholls et al, PROTEINS, Structure, Function and Genetics,

Vol. 11, No. 4, pg 281 ff (1991)), for calculating electrostatic potential. The dynamic

surface is calculated from molecular dynamics data, which is calculated by using any

number of molecular dynamics software packages. Such packages are well known to

those of skill in the art. The Dynamic Surface command color codes the residues in the

protein according to movement or flexibility during molecular dynamics simulation. For

example, "hot" residues move the most and are colored red; residues that move over average trajectories are colored green; and the "cool" residues that exhibit the least

movement are colored blue.

Figures 24-29 are representations of the Quality menu selections. The commands

in the Quality menu are used to evaluate the quality of a model structure by validating

that the structural and energetic characteristics of the molecule are within a reasonable

expected range of values. Figure 24 shows a Ramachandran plot for the protein molecule

shown in Figure 14. The plot is calculated interactively when the command is executed

and displays the phi and psi angle values for the protein structure in a separate window.

Figure 25 is a representation of the Quality menu selections and shows a

Balasubramanian plot for the protein molecule shown in Figure 14. The

Balasubramanian plot is also calculated interactively for the displayed protein and

indicates a directional arrow between the phi and psi angle values for each residue in the

protein. Each residue is color coded according to secondary structure based on the

dihedral angle values, for example, residues involved in helices are colored red and those

in beta sheets are colored blue.

Ellipsoid functionality for the protein molecule shown in Figure 14. The ellipsoid

displays the total extent or size of the protein in the x-, y- and z-directions.

Figure 27 is a representation of the Surrounding Hydrophobicity for the protein

molecule shown in Figure 14 as calculated using the Surface Hydrophobicity command.

The hydrophobic packing for the protein is precalculated based on how far a residue is

from the protein surface (Ponnuswamy and Prabhakaran, "Properties of nucleation sites in globular proteins", Biochem. and Biophys. Res. Comm., 97, 1582-1590 (1980);

Ponnuswamy et al, "Hydrophobic packing and spatial arrangement of amino acid

residues in globular proteins", Biochim. Biophys. Acta, 623, 301-316 (1980)), and the

result is plotted in a separate window. The interior residues are displayed in the center

of the plot in the area of highest hydrophobicity and the surface residues are shown at the

edge of the plot, indicating areas of lowest hydrophobicity. This plot can give

information on nucleation sites and can be compared to crystal structure data.

The Strain Plot and Local Strain Energy commands display plots of internal strain

energy in the molecule. The strain plot (Figure 28) is displayed in a separate window as

a graph of strain energy per residue. The local strain energy is displayed in a separate

window as a solid ribbon along the protein backbone which is colored according to strain

energy (for example, highly strained residues are colored red and unstrained residues are

colored blue). Strain energies are precalculated by using external programs, such as

ICM, and stored in the database in association with the protein structure. The Profile

Analysis command (Figure 29) displays a plot of packing factor per residue, which is

precalculated using a publicly available program such as WHAT IF (Vriend, "WHAT IF:

a molecular modeling and drug design program", J. Mol. Graph. 8, 52 (1990)) and stored

in a data file in the database. The profile analysis is displayed in a separate window as

a graph of packing factor per residue.

Figure 30 is a representation of the Align functionality. The Align command

aligns proteins within subfamilies. A list of proteins within the subfamily is displayed

in a viewer window and can be selected for superimposing on the protein molecule shown in the viewer window. A separate window displays the superposition of the

selected proteins.

4. Entering Protein Data Into the System Database

Figure 31 is a representation of a display screen for entering data about a protein

into the database server illustrated in Figure 1. As noted above, the database design of

the system illustrated in Figure 1 collects files of different types, such as text, graphic,

video, virtual reality, and audio. The database design further collects data files and places

them in a relational organization, so that files that are related to the same protein family

are readily accessible through a search routine. In accordance with known relational

database management programs, such searching may be carried out with specialized

search languages, such as Structured Query Language (SQL). In the preferred

embodiment, data is entered into the database by supplying pathnames to data files that

are loaded into the data storage of the database server 102 (Figure 1). The template for

entering such pathname information is supplied by the entry display screen 3102 of

Figure 31. The display screen includes fields that accept input that is manually entered

by an authorized user, as well as data file names to identify protein data files that will

become interrelated by the database management system.

As indicated in Figure 31 , the first data field 3104 is for the protein family name,

followed by the protein name 3106. This information is manually entered by a user who

gains access through the security server or other special access means at the database

server. When the information is received by the database management program at the

database server, it is automatically stored according to the database design. The next information that is manually entered is for the species name 3108. Following the species

name, the user who is entering the database information must supply data file names.

The file names will comprise a data file pathname to a data file stored at the database

server.

5 The first data file pathname 3110 is to an annotation file, which is a text

(alphanumeric) file that contains predetermined supplementary protein data, such as

disease information, ID numbers, and the like. Those skilled in the art will understand

that a text file has a standardized file extension of ".txt" in the filename. The annotation

file provides a means of collecting the supplementary protein data in one convenient file,

o and may be created by a user manually entering the necessary data, or may be the result

of processing other files or data, such as the protein PDB data file.

The next file pathname to be entered into the database is for an alignment text file

3112 that contains atomic alignment data. The next pathname is to a secondary structure

file 3114, which is another text file. A ribbon setup file pathname 3116 and a ribbon data

5 file pathname 3118 are next entered. These data files comprise virtual reality files, which

have a standardized file extension of ".wrl" and which are viewed in a web browser

program using any one of a variety of readily available plug-in applications. Those

skilled in the art will appreciate that a standard file extension such as " .wrl" is recognized

by operating systems and browsers to automatically trigger launch of a program that can

o view the data. For example, one " .wrl" viewer program that integrates with conventional

Internet browsers such as Netscape Navigator is the "Cosmo Player" by Platinum

Technology, Inc. for playing virtual reality files that are coded in accordance with the VRML 2.0 open standard for Internet programming. The next entered file pathname is

to a natural variant coordinate file 3120, which is a text file.

The next data pathname to be entered is for an electrostatic surface file 3122,

which is a virtual reality ".wrl" file. Those skilled in the art will understand the type of

molecule view that corresponds to an electrostatic surface. The next file is a text file that

contains surface hydrophobicity data 3124. Next is a profile analysis data file 3126,

which contains atomic coordinate data in the same format as a PDB file, but which is

created with homology model techniques. Next is a protein ellipsoid data file 3128, a

text file. An accessibility data file 3130 is next, comprising a text file.

A local strain data file 3132 is the next file pathname, and is a text file.

Thereafter, the next two files relate to local strain data. The first is a local strain ribbon

data ".wrl" virtual reality file 3134, and the second is a local strain ribbon line data ".wrl"

virtual reality file 3136. The next pathname is to another ".wrl" file, a dynamic surface

file 3138. The dynamic surface file provides a view of the molecular outer surface of the

protein.

The next items for database input relate to the initial model 3140. The first entry

is the pathname of a text file comprising an initial coordinates file 3142. The next three

items shown in Figure 31 are optional, in that the information they specify can be

extracted in real-time from other data files by the visualization toolkit interface program.

If the visualization toolkit interface program has the ability to extract such data in real¬

time, then the optional file pathnames shown in Figure 31 need not be supplied. These

optional files comprise a Ramachandran plot 3144, bond length graph 3146, and bond angle graph 3148. Those skilled in the art will appreciate that these three types of views

can be easily generated in real-time, without any need for pre-processing. The last data

entry relating to the initial model is for the depositor name 3150. This field identifies the

person who is entering the data, and is useful for identifying persons who should be able

to answer questions about the particular data files they entered.

It is possible to review the atomic coordinate data for a protein and, after careful

consideration by expert review, it is possible to improve the molecular representation by

modifying the coordinate data. Therefore, the preferred embodiment of the system

accepts multiple levels of data refinement, as indicated in Figure 31 by the "Refinement

Level 1" 3160 and "Refinement Level 2" 3162 entry fields. Thus, Level 1 includes a

field for the pathname of the Level 1 coordinates file 3164, optional fields for Level 1

refinement Ramachandran plot pathname 3166, Level 1 refinement bond length graph

pathname 3168, and Level 1 refinement bond angle graph pathname 3170, as well as the

mandatory field for Level 1 refinement depositor name 3172. Similarly, the Level 2

fields 3162 include a field for the pathname of the Level 2 coordinates file 3174, optional

fields for Level 2 refinement Ramachandran plot pathname 3176, Level 2 refinement

bond length graph pathname 3178, and Level 2 refinement bond angle graph pathname

3180, as well as the mandatory field for Level 2 refinement depositor name 3182.

Finally, at the bottom of the data entry display of Figure 31 , the entered data can

be submitted by clicking the "Submit" display button 3190. The entered information can

be cleared by selecting the "Clear" button 3192. 5. System Operation

Figure 32 is a flow diagram representation of operations performed when a user

accesses the protein database of the system illustrated in Figure 1. The first operation is

for the security server to perform a security check to ensure that the user has

5 authorization to proceed with access. This operation is represented by the Figure 32 flow

diagram box numbered 3202. As noted above, different users may be granted different

levels of access. This is described further below. After a user has been granted access,

the next operation is for the user to select a database for viewing. The request must be

authorized by the security server. This operation is represented by the flow diagram box

0 numbered 3204, and the screen display being viewed by the user corresponds to Figure

5.

Next, the user selects either a search query or an index list of protein names

available in the database. At this point, represented by the flow diagram box numbered

3206, the user is viewing a screen display like that of Figure 6. Selection of the index list

5 by a user will cause the user's network terminal to initiate a request to the database server

to execute a corresponding database server cartridge program to create and display the

index list of available protein names. In the preferred embodiment, the entire list of

available SBdBase protein names is shown to all users, regardless of authorization level.

The protein names available to the user may be indicated by special formatting, such as

o boldface or special characters. In this way, the user can be appraised of new database

entries and possibly interesting protein names that are unavailable. This can assist in migrating users from a lower level of authorization to a higher level of authorization for

viewing the proprietary database.

Next, the user selects a protein name for viewing, either from an index list or from

the results of a query search. At this step, once a protein name has been selected, the user

will be shown a display like that of Figure 11. Selecting a protein name causes the user's

network terminal to initiate a request to an appropriate web application server to create

the display page corresponding to Figure 11. In accordance with the web application

server programming, for example of the kind available from Oracle Development

Corporation, the proper display page is automatically created. This operation is

represented by the Figure 32 flow diagram box numbered 3208. The user may select

from multiple database choices. As illustrated in Figure 11, these choices may include

the "SwissProt" database, "GenBank" database, or PIR database, all of which contain

only text data. Alternatively, the choice may be for the "SDdBase", comprising the

multi-format protein relational database constructed in accordance with the present

invention. The selection of the "SBdBase" is indicated by the flow diagram box

numbered 3210.

With selection of the "SBdBase" button, the user's network terminal sends a

request to the security server, which recognizes the "SBdBase" selection and initiates

execution of appropriate "Java" programming language scripts. Such scripts are executed

quickly and without database server involvement, and so provide convenient and more

efficient operation of the system. This operation is represented by the flow diagram box

numbered 3212. The Java scripts cause the relevant database pathnames to be provided from the security server to Java language programming routines at the database server

controller. The pathnames may include, for example, the pathnames corresponding to

the multi-format data files entered into the database, such as illustrated in Figure 31. The

pathname sending operation is represented by the Figure 32 flow diagram box numbered

3214.

When the database controller receives the database pathnames, the controller

causes the associated database files to be sent from the database server through the

security server and on to the client network terminal, thereby causing the database files

to be downloaded to the network terminal. This operation is represented by the flow

diagram box numbered 3216. After the files have been downloaded to the network

terminal, the user at the terminal can invoke the "Visualization Toolkit" and view all

associated displays (comprising the views illustrated in Figures 14 through 30). Such

viewing will be accomplished through the user's web browser program, as described

above, and comprise entirely local operations. Thus, network traffic will not be involved.

These local operations are represented by the Figure 32 flow diagram box numbered

3218. Thus, viewing operations can be continued locally, as indicated by the "Continue"

box 3220 of Figure 32.

Figure 33 is a more detailed flow diagram representation of the operations

performed during the security checking operation box 3202 of Figure 32. In Figure 33,

the first operating step is represented by the sending of a request for access from the

user's network terminal to the security server. This operation is represented by the flow

diagram box numbered 3302. In the preferred embodiment, the security server checks the Internet protocol (IP) address of the sending user against an authorization list and

grants conditional privileges accordingly. This operation is represented by the flow

diagram box numbered 3304.

Next, the security server checks the received user login information for

verification. This may comprise, for example, checking user name and user password

information received from the network terminal. This security operation is represented

by the flow diagram box numbered 3306. If all security checking is approved, then the

security server grants access and returns appropriate display screen information to the

user, as represented by the flow diagram box numbered 3308. The system operation then

continues as described in Figure 32.

As noted above, a relational database management system (RDBMS) provides

access, security control, and search and retrieval services for the database files stored at

the database server. Figure 34 is a representation of the database schema used by the

relational database management system illustrated in Figure 1. The database schema

defines the tables that will be used by the RDBMS in performing the services described

above. Each protein in the database will have data in the RDBMS tables corresponding

to the fields described in Figure 34. Thus, the tables in Figure 34 determine the fields

over which the RDBMS can perform the search and retrieval services and determine the

level of control exercised over the other services provided by the RDBMS. In the

preferred embodiment, the tables are implemented by the RDBMS produced by Oracle

Development Corporation. Those skilled in the art will understand that RDBMS

products from other vendors are equally suitable. The primary table in Figure 34, called MY_CORE, includes fields from which

the menu display of Figure 5 is created. The MY CORE fields include a field for a

specific protein identification number, listed as SP ID, unique to the database

implementation, for internal identification. Other fields are for accession number

(ACC NUM), pdb-file indicator (PDB_FLAG), database flag for the additional views

provided by the invention (SBDBASE_FLAG), special code (SPEC_CODE), a

description field (DESCRIPTION), disease notes (DISEASE NOTES), function notes

(FUNCTION_NOTES), sequence data (MY_SEQUENCE), and keywords for searching

(KEYWORDS). The MY_CORE table also includes fields for gene data (GENES),

enzyme data (ENZYME), pdb data (PDB_DATA), and notes to indicate similarities to

other proteins (SIMILARITY_NOTES).

Another table for the RDBMS is called FAMILY, which contains a single field

for the family name of the protein to which the Figure 34 tables correspond. Another

table in Figure 34 is called SUBFAMILY, which contains the family name and also any

subfamily name for the protein. A PROTEINS table contains fields for protein family

name, subfamily name, protein name, and protein identification number (SP ID). Thus,

a system user can search the database, using the RDBMS, and search for protein name,

protein family, protein subfamily, and protein identification number.

The RDBMS tables also include a USER_ACCESS table, with which the

RDBMS controls access to the database depending on the individual user. That is, for

each protein entry in the database, the USER ACCESS table indicates whether a

particular user has been granted access. As noted above, a user can be granted viewing access to the database, protein by protein. Thus, the USER_ACCESS table has fields for

user name (USER_NAME), protein identification (SP_ID), and a conventional protein

identification number.

The CUSTOMERS table contains information that is used by the RDBMS to

control database access according to customer accounts. The fields for CUSTOMERS

include USER_NAME, START_DATE, and END DATE. Thus, customer access is

controlled by time period, to reflect whether a customer account is current or past due for

payments.

If the information in the Figure 34 tables are used for search and retrieval, then

the RDBMS uses the information in the fields to perform searching. Once proteins or

other database entries are identified as satisfying the search request, the RDBMS retries

the data files from the database server as described above and the data files are

downloaded to the user.

The browser display technique described above is advantageous in that a wide

variety of data formats can be accessed and displayed from within the same viewing

program, independently of the operating system being used, as long as the browser has

been configured with the appropriate helper applications. Those skilled in the art will

appreciate that conventional browser programs are developed with object-oriented

programming techniques, and that such browsers can be made to display the protein

database information in an appropriate manner through proper interface with the browser

programs. In particular, the Visualization Toolkit display shown in Figure is an

application that executes from within the browser and provides the special menus and window views shown above in the drawing figures. Figure 35 shows the classes that

interface with the browser, in a manner specified by the browser vendor.

Figure 35 indicates object classes for the browser product from Netscape

Communications Corporation, but other suitable browsers may be used and will occur

to those skilled in the art. The classes include an Ellipsoid class that generates the

ellipsoid view from the View menu. The class HydrophobicityPlot generates the

hydrophobicity plot view. The class "netscape. application.InternalWindow" generates

a window within the browser, according to the Netscape specification for the class

netscape. application. Window. The subclasses for the internal window include windows

for the Baluchandran plot (class sbiBaluCanvas), the hydrophobicity internal window

(class sbiHydrophobicity), the List window (class List), the Profile window (class

sbiProfile), the Protein Viewer window (class sbiProtViewer), the Ramachandran

window (class sbiRama), the Sequence Viewer (class sbiSeqViewer), and the Strain

window (class sbiStrain).

Other classes are for display windows or to perform calculations on-the-fly to

generate data that is displayed in windows. Such classes include generating the Profile

Plot (class ProfilePlot to generate data for the sbiProfile class), generating the Strain Plot

window (class StrainPlot to generate strain data), a graph test function (class graphTest),

generating the hydrophobicity graph (class hydrophobicityGraph to generate data), and

generating the Profile data (class profileGraph). Other classes provide the Alignment

window (class sbiAlignViewer), generating the Balucandran data (class sbiBalu),

producing the Baluchandran window buttons (class sbiBaluButtons), providing the Ellipsoid window (class sbiEllipsoid), and providing the browser frame (class sbiGui).

Another class provides the pdb "Canvas" viewer window (class sbiPdbCanvas). Those

skilled in the art will recognize that "Canvas" is a particular viewing program for data.

Another class provides the pdb-data viewer (class sbiPdb Viewer), and another class

5 provides the window frame slider control (class sbiSlider). Finally, other classes provide

the Active Sites display (class sbiActiveSites) and provide a converter application (class

sbiConverter).

6. Alternative Embodiment with a Distributed Network Architecture

Figure 36 shows the configuration of an alternative network system 3600

0 constructed in accordance with the present invention, using a distributed network

architecture and "Enterprise Java Beans" (EJB) components in a multi-tiered

configuration. The distributed architecture may be characterized as comprising multiple

programming tiers. This characterization of an EJB implementation will be familiar to

those skilled in the art. See, for example, Client/Server Programming with JAVA and

5 CORBA (2nd edition), by Robert Orfali and Dan Harkey, pp. 33-50.

The distributed system 3600 permits multiple client machines 3602, designated

Web Client 1 , Web Client 2, ..., Web Client n, in a first tier to communicate over a shared

network 3604, such as the Internet, with a second tier comprising an

Authorization/Security access server 3606 that controls access by the clients 3602 to a

o bioinformatics database. The access server 3606 can comprise one or more programs and

machines that perform the duties of the security server 106 and database server 102 of

the embodiment illustrated in Figure 1. The client machines 3602 execute one or more user interface applets to interface

with the Authorization server 3606, which communicates with multiple "Enterprise Java

Beans" (EJB) components 3608 that provide the functionality needed to generate the

display panels and features illustrated in Figures 37 through 45 for the second

embodiment, as described further below. The Authorization server and EJB components

form the second tier of networking and communicate with a third tier of the system, a

Bioinformatics Database Management System (DBMS) server 3610. The Bioinformatics

DBMS server manages the collection of protein data stored in a database and provides

the protein data in response to requests and queries from the users 3602.

Those skilled in the art will understand how the EJB components 3608 can be

implemented using a distributed object model of the database 3610. For example, the

database can be structured to communicate with the clients according to the object

communication standard called Common Object Request Broker -Architecture (CORBA),

which is specified by the industry consortium called Object Management Group (OMG).

The CORBA standard will be familiar to those skilled in the art of database design.

Figure 37 is a block diagram representation of the classes into which the EJB

components 3608 are organized. These components provide the functionality for the

GUI of the second embodiment, as described further below. The functionality of these

components generally duplicates the functionality of the corresponding classes described

above for the first embodiment of Figure 13 through Figure 35. Figure 37 shows that the

classes include a Java class for an Alignment database, to provide alignment views of proteins when requested by the client user, and also include a Java class for an Atomic

database, to provide atom-specific data.

The EJB classes 3608 also include classes for protein Chains, Domain, Family,

Protein, Residue, Secondary Structure, Subfamilies, Subfamily Proteins, Users, VAST,

and VRML The VRML class provides support for the "Virtual Reality" display system

which, in the preferred embodiment, is achieved through the "Cosmo" VRML player

interface developed by Silicon Graphics, Inc. of Mountain View, California USA. A

Deployment EJB class handles communications tasks between user clients and the

application server for activation of classes. The VAST component performs processing

to check the database 3610 for protein similarity, such as for data mining functions. This

information is used in generating the displays for local strain energy, secondary structures

dynamic surface, and hydrostatic surface.

In the preferred embodiment, the VAST component provides an interface to a

portion of the database 3610 (see Figure 36) that is constructed using the "Vector

Alignment Search Tool" (VAST) search program that is publicly available from the

National Institutes of Health (NIH) in Bethesda, Maryland, USA. In the preferred

embodiment illustrated in Figure 36, the database includes VAST output for the database

proteins. Thus, the VAST program is executed on the database proteins to create a

database of protein similarity information, in accordance with the VAST standard. In this

way, the VAST EJB component provides an interface to the VAST output data, such that

a similarity search request from a system user can be provided by appropriately scanning

the VAST database, rather than attempting to execute a comparison operation in real time. This improves the response time of the system. Those skilled in the art will be

familiar with the VAST search methodology, details of which are available from the

National Center for Biotechnology Information (NCBI) division of the National Library

of Medicine (NLM) at the NTH (available on-line through the World Wide Web at

5 www.ncbi.nlm.nih.gov).

Figure 38 is a representation of an Application screen 3800 shown to a user at a

client machine display of the network system illustrated in Figure 36. The Application

screen is shown in a display window at a client machine, such as in a Java applet

window. Those skilled in the art will understand the automatic launch of Java applets

l o from a web browser. In the preferred embodiment of the distributed architecture system,

the applet window includes a menu bar 3802 along the top of the applet window, a

Protein Selection display panel 3804 along the left side of the applet display, with a sub-

panel 3806 for showing proteins selected by the user and available for viewing, and a

Visualization panel 3808 at the right side of the applet display.

15 In Figure 38, the Protein Selection panel 3804 shows a Family Tree display that

lists the protein families that may be selected for investigation by the system user. As

described further below, other display tabs in the Protein Selection panel indicate that a

Search panel and a Data Mining panel may be called up, in addition to the Family Tree

panel shown. In this regard, it should be noted that this embodiment of the invention

0 permits especially rapid review and easy selection of protein families without need for

the multiple window displays associated with the first embodiment described above in

conjunction with Figures 5 through 9. A user can select protein families by positioning computer the display cursor on a desired protein family in the Family Tree display and

then using a display mouse button to "double-click" and select the protein family. The

protein family name will then be listed in the "Proteins Available for Viewing" panel in

the lower left area of the applet window.

In the Visualization panel 3808 on the right side of the applet window, the protein

may be displayed in one of several formats. In the preferred embodiment, these formats

include Description, Sequence, Structure, and Quality views, which are selected using

the computer display mouse and the tabs along the top edge of the Visualization panel.

The details of these display formats are the same as for the corresponding views

described above in conjunction with Figures 11 through 30. As noted below, however,

the alternative embodiment of Figure 36 uses a distributed network architecture for the

database and for the user-to-database interface, and thereby permits simultaneous display

of multiple proteins and associated data in the Visualization panel through EJB

components.

The Visualization panel of Figure 38 also shows that various sub-panels 3810

may be displayed and manipulated. For example, in the lower right area of the

Visualization panel, protein views may be selected to show Ramachandran plots,

Balasubram plots, Hydrophobicity data, Strain plots, and Profiles plots. In addition,

Sequence and Secondary Structure information may be shown simultaneously, as

indicated by the sub-panel adj acent to the Profile Analysis view at the bottom right of the

Visualization panel. The details of these display formats are the same as for the

corresponding views described above in conjunction with Figures 11 through 30. As noted below, however, the alternative embodiment of Figure 36 uses a distributed

network architecture for the database and for the user-to-database interface and thereby

permits simultaneous display of multiple proteins and associated data in the Visualization

panel.

Figure 39 is a representation of the Application screen 3900 shown to a user at

a client machine of the network system illustrated in Figure 36, illustrating the Search

panel that is shown in the application window when a user selects that display tab from

the Protein Selection display area on the left side of the display 3904. Figure 39 shows

that a user enters a protein pattern identifier in a protein dialogue box 3910 of the display

and then selects a Search display button 3912. The results of a search for the identified

protein are then listed in a text window 3914, showing the protein families that matched

the search string. Those skilled in the art will understand the protein pattern identifier

notation used to identify protein families. Beneath the pattern search panel, the web

client applet shows sequence information 3916 for a user-selected one of the search result

proteins. As noted above, one of the proteins may then be selected for viewing in the

Visualization area of the display screen, and multiple proteins may be selected for

simultaneous display.

Figure 40 is a representation of the Application screen 4000 shown to a user at

a client machine of the network system illustrated in Figure 36, showing the Data Mining

panel selected using the display tab of the Protein Selection panel 4004 at the left side of

the application window. Figure 40 shows that selection of the Data Mining tab generates

a dialogue box 4010 in which a protein name is entered, and on which a similarity search will be conducted. Figure 40 shows that a "Find All Similar Proteins" display button

4012 is provided. When the user selects the display button, the web client applet

forwards the protein search identifier entered by the user in the dialogue box 4010 and

provides it to the application server and then to the database management system. A

database search is then conducted and a list of similar proteins is provided in the text

window 4014 shown beneath the "Find" button.

From the list of Similar Proteins found from the database, a user can select one

or more of the "Similar Proteins" for further information. For example, Figure 40 shows

a panel 4016 below the "Find" box that is called "Similar Cores Between Selected

Proteins", in which the target protein entered in the dialogue box is shown, followed by

information for a selected one of the "Similar Proteins". In the "Proteins Available for

Viewing" panel 4018, both the target protein (shown as "muscle type acylphosphatease")

is listed and the "Similar Protein" (shown as "orphan nuclear receptor HMR") is shown.

When these proteins are selected for viewing, as indicated by their entry in the "Proteins

Available for Viewing" panel, their corresponding views are shown in the Visualization

panel 4008.

Figure 41 is a representation of the Application screen 4100 shown to a user at

a client machine of the network system illustrated in Figure 36, illustrating that

simultaneous proteins can be shown in the Visualization panel 4108. In particular, Figure

41 shows Quality display features for two selected proteins. The two proteins are

identified in the "Proteins Available for Viewing" panel 4108 at the lower left area of the

Application screen, and their corresponding Ramachandran plots of the Quality view are simultaneously shown in the visualization panel 4108, along with their respective Profile

-Analysis panels with Residue information. As noted above, it should be understood that

all of the Description, Sequence, Structure, and Quality sub-panels and display views

described above for the first embodiment are shown in the respective display views in the

second embodiment. In the second embodiment of Figures 36 through 46, however,

multiple proteins can be shown simultaneously, advantageously using the web client

applets and the distributed architecture.

Figure 42 is a representation of the Application screen 4200 shown to a user at

a client machine of the network system illustrated in Figure 36, illustrating the

simultaneous display of protein Structure features for two selected proteins in the

Visualization panel 4208 of the application window. Moreover, Figure 42 shows that the

applet panels can be resized in accordance with known window programming techniques,

thereby making it possible to devote greater screen area to panels of greater interest.

Thus, in Figure 42, the visual display Structure panel of the Visualization area has been

resized to occupy a greater portion of the user's display screen as compared with the

visualization panel of Figures 38 through 41. Accordingly, the Protein Selection area

occupies less display screen area (compare, for example, with Figure 39). It should be

apparent to those skilled in the art that the resizing is achieved with the left and right

buttons of a display mouse, appropriately dragging the display cursor for the desired

resizing. Again, details of the information shown in the Structure display of the

Visualization panel are the same as those described above for the first embodiment. The

second embodiment, however, permits simultaneous display of multiple proteins. Figure 43 is a representation 4300 of the Application screen of Figure 42, this

time showing the drop down menu 4320 for selection of Display features. More

particularly, Figure 43 of the Display item of the applet menu bar shows that a user may

select between a Skeleton display format, a Ball & Sticks format, a Spacefill format, and

a VDW Dot Skeleton display format for the Visualization panel. These display formats

will be known to those skilled in the art, so that such persons will understand what

information will be shown in such display views without further explanation. Again,

details of the information shown in the Display formats of the Visualization panel are the

same as those described above for corresponding Display views of the first embodiment.

Figure 44 is a representation 4400 of the Application screen of Figure 42,

showing the drop down menu 4420 for selection of Options for atoms to be viewed.

More particularly, Figure 44 of the Options item of the applet menu bar shows that a user

may select between a "C- Alpha Atoms" view, a "Main Chain Atoms" view, and an "All

Atoms" view for the Visualization panel. These view formats will be known to those

skilled in the art, so that such persons will understand what information will be shown

in such display views without further explanation. Again, details of the information

shown in the Options formats of the Visualization panel are the same as those described

above for corresponding views of the first embodiment.

Figure 45 is a representation of the Application screen 4500 shown to a user at

a client machine of the network system illustrated in Figure 36, illustrating simultaneous

Sequence display features for two selected proteins in the Visualization panel 4508.

Figure 45 shows the Visualization panel in the resized condition first shown in Figure 42, illustrating sequence information for multiple proteins. As before, the proteins are

selected through the "Proteins Available for Viewing" panel 4508 at the lower left of the

display.

Figure 46 is a graphical representation of the database objects for the database

design of the system illustrated in Figure 36. As with the first embodiment, the system

of the second embodiment shown in Figure 36 is implemented using object oriented

programming techniques in which data objects are organized into classes, each of which

is characterized by attributes that specify parameters of the class and methods or

processes that specify behaviors of the class.

More particularly, Figure 46 shows that the database design of the second

embodiment includes a Residue class, an Atom class, a Domain class, and an Active Sites

class. The database also includes a Protein class, a Protein Sequence class, a Sequence

Link class, a Chain class, Secondary Structure class, VRMLURL class, Protein Segment

class, Protein Search class, Transformation class, an Alignment class, an Alignment

Residue class, and an Elements class. The database design also includes a Subfamily

class, a Subfamily Proteins class, and a Family class. Finally, the database includes a

Useraccess class, a UserAccessProteins class, a Feedback class, and a BugGroup class.

Attributes of the database classes are shown in the class boxes of Figure 46. These

classes store attribute data values and specify class behaviors to provide the functionality

described herein.

For example, the Residue class stores parameters to produce a protein residue

display with features such as illustrated in connection with the first embodiment (Figures 1 through 35) and the second embodiment (Figures 36 through 46). That is, the Residue

class contains the information typically needed to specify a residue display, which will

be apparent to those skilled in the art without further explanation. Similarly, the Atom

class contains information needed to specify display of a protein atom, the Domain class

contains information needed to specify a protein domain, and the Active Sites class

contains information needed to specify the active sites of a protein for display. Those

skilled in the art will understand the information needed to specify display of such

features, which are like those specified by the first embodiment described above for

corresponding displays in conjunction with Figures 1 through 35. Those skilled in the

art will likewise appreciate the information needed for the other classes shown in Figure

46, in conjunction with the description of classes for the first embodiment and for display

of the features described above.

Thus, the second embodiment illustrated in Figures 36 through 46 implements a

bioinformatics database access system having a graphical user interface (GUI) that

communicates over a shared network (such as the Internet) using graphical browsers to

establish a communications session. Once communications with the database server are

established, the GUI is provided through a platform-independent applet environment,

such as provided with a Java programming environment. Thus, no hypertext mark-up

language (HTML) page links are needed, and no common gateway interface (CGI) scripts

are needed to exchange information and retrieve database information and create

displays. Such processing, for example, also permits the simultaneous display of

database information for more than one protein. Such programming also permits a Protein Selection panel to be shown adjacent a Visualization panel, thereby permitting

search and selection of proteins, followed by display of visualization data for such

proteins, from the same display window. Thus, families of proteins can be shown along

the left side of the display, while visualization displays of the proteins can be shown

5 simultaneously along the right side of the display. This functionality is possible because,

with the Java implementation, the same applet that communicates with the bioinformatics

database also performs the visualization display tasks.

The present invention has been described above in terms of presently preferred

embodiments so that an understanding of the present invention can be conveyed. There

l o are, however, many configurations for protein database viewing systems not specifically

described herein but with which the present invention is applicable. The present

invention should therefore not be seen as limited to the particular embodiments described

herein, but rather, it should be understood that the present invention has wide

applicability with respect to molecular structure database viewing systems generally. All

5 modifications, variations, or equivalent arrangements and implementations that are within

the scope of the attached claims should therefore be considered within the scope of the

invention.

Claims

CLAIMSWE CLAIM:

1. A method of accessing molecular structure information over a computer

network and graphically viewing the information, the method comprising:

receiving authorization information from a user and checking for authorization

by that user to a database containing molecular structure information;

receiving information from a user identifying requested molecular structure

information;

providing the user access authorization;

downloading multiple files containing the requested molecular structure

information from a database server to the authorized user; and

viewing the downloaded files at the user with a graphical browser application

program.

2. A method as defined in claim 1, wherein the database comprises a

relational database containing molecular structure information for multiple protein

structures.

3. A method as defined in claim 1, wherein providing user access

authorization comprises checking user account information that permits user access to

less than the entire database.

4. A method as defined in claim 3, wherein providing user access

authorization comprises providing the user with a display of database portions to which

the user has been granted authorization.

5. A method as defined in claim 4, further comprising displaying the entire

available database and indicating those database portions to which the user has been

granted authorization.

6. A method as defined in claim 1, wherein viewing includes executing

helper applications to view different file formats from within the browser application

program.

7. Amethod as defined in claim 6, further comprising displaying amolecular

structure and displaying a sequence alignment view in an external frame that can be

separately manipulated by the user and that interacts with the molecular structure display.

8. A method as defined in claim 7, further comprising displaying secondary

structure elements with predetermined structural symbols.

9. A method as defined in claim 6, wherein the file formats include virtual

reality formats.

10. A method as defined in claim 1, wherein providing the user access

5 authorization occurs at a security server, and the step of downloading multiple files

occurs from a database file server.

11. A method as defined in claim 1 , wherein receiving information from a

user and receiving multiple files comprises display of corresponding information in a

o single application window of a computer display.

12. A method as defined in claim 1, wherein receiving multiple files

comprises display of visualization data for multiple proteins selected by the user.

s

13. A computer database system comprising:

a security server that receives user requests for user access authorization to a

database containing molecular structure information and receives

identification information from the user identifying requested molecular

structure information, and then determines if user access authorization

o should be provided; a database server that responds to a user access authorization by downloading

multiple files containing the requested molecular structure information to

the authorized user; and

a graphical browser application program that receives the downloaded files at the

user and displays the information contained therein.

14. A system as defined in claim 13, wherein the database comprises a

structures.

0

15. A system as defined in claim 13 , wherein the security server provides user

access authorization by checking user account information that permits user access to less

than the entire database.

5 16. A system as defined in claim 15 , wherein the security server provides user

access authorization by providing the user with a display of database portions to which

the user has been granted authorization.

17. A system as defined in claim 16, wherein the security server displays the

o entire available database to the user prior to downloading and indicates those database

portions to which the user has been granted authorization.

18. A system as defined in claim 13, wherein the user graphical browser

application program executes helper applications to view different file formats from

within the browser application program.

19. A system as defined in claim 18, wherein the downloaded files received

by the user graphical browser application program permit it to display a molecular

structure and to display a sequence alignment view in an external frame that can be

20. A system as defined in claim 19, wherein the graphical browser

application program displays secondary structure elements with predetermined structural

symbols.

21. A system as defined in claim 18, wherein the file formats include virtual

reality formats.

22. A system as defined in claim 13 , wherein the security server and database

server are separate computers connected over a network.

23. A method of operating a server for controlling access to molecular

structure information over a computer network, the method comprising: receiving authorization information from a user and checking for authorization

by that user to a database containing molecular structure information;

receiving information from a user identifying requested molecular structure

information; and

granting the user authorization for access if it is determined that the user should

be provided with access authorization to permit downloading multiple

files containing the requested molecular structure information from a

database server to the authorized user, where the user can view the

downloaded files with a graphical browser application program.

0

24. A method as defined in claim 23, wherein the database comprises a

structures.

5 25. A method as defined in claim 23 , wherein the step of granting user access

less than the entire database.

26. A method as defined in claim 25, wherein the step of granting user access

o authorization comprises providing the user with a display of database portions to which

the user has been granted authorization.

27. A method as defined in claim 26, further comprising the step of permitting

display of the entire available database and indicating those database portions to which

the user has been granted authorization.

28. A method as defined in claim 23, wherein the granted authorization

permits downloading of different file formats that are viewed from within the browser

application program with helper applications.

29. A method as defined in claim 28, wherein the security server permits

downloading of files containing data that permit displaying a sequence alignment view

in an external frame of the browser application program such that it can be separately

manipulated by the user and interacts with the molecular structure display.

30. Amethod as defined in claim 29, further comprising the step of displaying

secondary structure elements with predetermined structural symbols.

31. A method as defined in claim 28, wherein the file formats include virtual

reality formats.

32. A method as defined in claim 23, wherein the step of providing the user

access authorization occurs at a security server, and the step of downloading multiple

files occurs from a database file server.

33. A method of operating a database server for providing molecular structure

information over a computer network, the method comprising:

receiving a user authorization following a security authorization check that grants

a user access to a database containing molecular structure information;

5 receiving information identifying requested molecular structure information; and

downloading multiple files containing the requested molecular structure

information to the authorized user, where the user can view the

downloaded files with a graphical browser application program.

0 34. A method as defined in claim 33, wherein the database comprises a

structures.

35. A method as defined in claim 33, wherein the granted authorization

5 permits downloading of different file formats that are viewed from within the browser

application program with helper applications.

36. A method as defined in claim 35, wherein the downloaded files include

data that permits displaying a sequence alignment view in an external frame of the

o browser application program such that it can be separately manipulated by the user and

interacts with the molecular structure display.

37. A method as defined in claim 36, further comprising the step of displaying

secondary structure elements with predetermined structural symbols.

38. A method as defined in claim 35 , wherein the file formats include virtual

reality formats.

39. A method of operating a user computer for accessing molecular structure

information over a computer network, the method comprising:

providing authorization information to a security server to request authorization

to a database containing molecular structure information and receiving an

authorization response;

providing information that identifies requested molecular structure information;

receiving multiple files in an authorized download operation, the downloaded

files containing the requested molecular structure information; and

viewing the downloaded files with a graphical browser application program.

40. A method as defined in claim 39, wherein the database comprises a

structures.

41. A method as defined in claim 39, wherein the authorization information

includes information sufficient for the security server to check user account information

that permits user access to less than the entire database.

5 42. A method as defined in claim 41, wherein the authorization response

includes information that provides a display of database portions to which authorization

has been granted.

43. A method as defined in claim 42, wherein the authorization response

0 further includes information that provides a display of the entire available database and

indicates those database portions to which authorization has been granted.

44. A method as defined in claim 39, wherein the granted authorization

5 application program with helper applications.

45. A method as defined in claim 44, wherein the downloaded files contain

browser application program such that it can be separately manipulated by the user and

o interacts with the molecular structure display.

46. A method as defined in claim 45 , further comprising the step of displaying

secondary structure elements with predetermined structural symbols.

47. A method as defined in claim 39, wherein the file formats include virtual

reality formats.

48. A method as defined in claim 39, wherein the downloaded multiple files

are received from a database file server.