US20140379443A1 - Methods, systems, and media for applying scores and ratings to web pages,web sites, and content for safe and effective online advertising - Google Patents

Methods, systems, and media for applying scores and ratings to web pages,web sites, and content for safe and effective online advertising Download PDF

Info

Publication number
US20140379443A1
US20140379443A1 US14/184,264 US201414184264A US2014379443A1 US 20140379443 A1 US20140379443 A1 US 20140379443A1 US 201414184264 A US201414184264 A US 201414184264A US 2014379443 A1 US2014379443 A1 US 2014379443A1
Authority
US
United States
Prior art keywords
rating
ordinomial
ratings
content
posterior
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US14/184,264
Inventor
Joshua M Attenberg
Foster J Provost
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Integral Ad Science Inc
Original Assignee
Integral Ad Science Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US12/859,763 external-priority patent/US20110047006A1/en
Application filed by Integral Ad Science Inc filed Critical Integral Ad Science Inc
Priority to US14/184,264 priority Critical patent/US20140379443A1/en
Publication of US20140379443A1 publication Critical patent/US20140379443A1/en
Assigned to SILICON VALLEY BANK reassignment SILICON VALLEY BANK SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: INTEGRAL AD SCIENCE, INC.
Assigned to ADSAFE MEDIA, LTD. reassignment ADSAFE MEDIA, LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: PROVOST, FOSTER J., ATTENBERG, JOSHUA M.
Assigned to INTEGRAL AD SCIENCE, INC. reassignment INTEGRAL AD SCIENCE, INC. CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: ADSAFE MEDIA, LTD.
Assigned to INTEGRAL AD SCIENCE, INC. reassignment INTEGRAL AD SCIENCE, INC. TERMINATION AND RELEASE OF INTELLECTUAL PROPERTY SECURITY AGREEMENT Assignors: SILICON VALLEY BANK
Assigned to GOLDMAN SACHS BDC, INC., AS COLLATERAL AGENT reassignment GOLDMAN SACHS BDC, INC., AS COLLATERAL AGENT PATENT SECURITY AGREEMENT Assignors: INTEGRAL AD SCIENCE, INC.
Assigned to INTEGRAL AD SCIENCE, INC. reassignment INTEGRAL AD SCIENCE, INC. RELEASE OF SECURITY INTEREST IN PATENT COLLATERAL AT REEL/FRAME NO. 46594/0001 Assignors: GOLDMAN SACHS BDC, INC., AS COLLATERAL AGENT
Assigned to PNC BANK, NATIONAL ASSOCIATION, AS ADMINISTRATIVE AGENT reassignment PNC BANK, NATIONAL ASSOCIATION, AS ADMINISTRATIVE AGENT PATENT SECURITY AGREEMENT Assignors: INTEGRAL AD SCIENCE, INC.
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0251Targeted advertisements
    • G06Q30/0263Targeted advertisements based upon Internet or website rating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F17/3053
    • G06F17/30861
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0282Rating or review of business operators or products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0277Online advertisement

Definitions

  • the disclosed subject matter generally relates to methods, systems, and media for applying scores and ratings to web pages, web sites, and other pieces of content of interest to advertisers or content providers for safe and effective online advertising.
  • Online advertisers use tools that provide information about websites or publishers and the viewers of such websites to facilitate more effective planning and management of online advertising by advertisers.
  • online advertisers continually desire increased control over the web pages on which their advertisements and brand messages appear. For example, particular online advertisers want to control the risk that their advertisements and brand messages appear on pages or sites that contain objectionable content (e.g., pornography or adult content, hate speech, bombs, guns, ammunition, alcohol, offensive language, tobacco, spyware, malicious code, illegal drugs, music downloading, particular types of entertainment, illegality, obscenity, etc.).
  • objectionable content e.g., pornography or adult content, hate speech, bombs, guns, ammunition, alcohol, offensive language, tobacco, spyware, malicious code, illegal drugs, music downloading, particular types of entertainment, illegality, obscenity, etc.
  • advertisers for adult-oriented products, such as alcohol and tobacco want to avoid pages directed towards children.
  • the disclosed subject matter provides advertisers, agencies, advertisement networks, advertisement exchanges, and publishers with the ability to make risk-controlled decisions based on the category-specific risk and/or general risk associated with a given web page, website, etc.
  • advertisers, agencies, advertisement networks, advertisement exchanges, and publishers can determine whether to place a particular advertisement on a particular web page based on a high confidence that the page does not contain objectionable content.
  • advertisers, agencies, advertisement networks, advertisement exchanges, and publishers can request to view a list of pages in their current advertisement network traffic assessed to have the highest risk of objectionable content.
  • the risk rating can, in some embodiments, represent the probability that a page or a site contains or will contain objectionable content, the degree of objectionability of the content, and/or any suitable combination thereof.
  • the risk rating can be determined for a single domain and/or a single category such that a particular piece of media or content can have a rating for each of a number of objectionable content categories.
  • the risk rating can be determined across several objectionable content categories, across multiple pieces of content (e.g., the pages appearing in the advertiser's traffic), and/or across multiple domains managed by a publisher.
  • these mechanisms can be generated using multiple statistical models and considering multiple pieces of evidence. In some embodiments, these mechanisms can account for temporal dynamics in content by determining a risk rating that is based on the probability of encountering different severity levels from a given URL and that is based on the types of estimated severity exhibited in the past.
  • these mechanisms can evaluate the quality of collections of content. More particularly, these mechanisms can collect individual content ratings (e.g., ordinal ratings and/or real-valued ratings), aggregate these ratings across arbitrary subsets, normalize these ordinal and real-valued ratings onto a general index scale, and calibrate and/or map the normalized ratings using a global mean to provide a benchmark for comparison. This mapping can capture the risk and/or severity profiles of appearance of content.
  • individual content ratings e.g., ordinal ratings and/or real-valued ratings
  • This mapping can capture the risk and/or severity profiles of appearance of content.
  • the method comprises: extracting one or more features from a piece of web content; applying a plurality of statistical models to the extracted features to generate a plurality of ordinomial estimates, wherein each ordinomial estimate represents a probability that the web content is a member of one of a plurality of severity groups; determining a posterior ordinomial estimate for the web content by combining the plurality of ordinomial estimates; generating a risk rating that encodes severity and confidence based on the determined posterior ordinomial estimate, wherein the risk rating identifies whether the web content is likely to contain objectionable content of a given category; and providing the risk rating for determining whether an advertisement should be associated with the web content.
  • the method further comprises: determining a plurality of posterior ordinomial estimates at a plurality of times for the web content; and determining an expected posterior ordinomial estimate by combining the plurality of posterior ordinomial estimates over the plurality of times.
  • the method further comprises: extracting a uniform resource locator from the one or more features; assembling a first set of posterior ordinomial estimates from the plurality of posterior ordinomial estimates based on the uniform resource locator; and determining the expected posterior ordinomial estimate by combining the first set of posterior ordinomial estimates over the plurality of times.
  • the method further comprises: determining that the web content belongs to a sitelet, wherein the sitelet includes a plurality of web pages; determining a sitelet ordinomial by aggregating the plurality of posterior ordinomial estimates associated with each of the plurality of web pages; and generating a sitelet rating based on the aggregated plurality of posterior ordinomials.
  • the method further comprises: comparing the sitelet ordinomial with the plurality of posterior ordinomial estimates associated with each of the plurality of web pages belonging to the sitelet; and determining whether to store at least one of the sitelet ordinomial and the plurality of posterior ordinomial estimates based on the comparison and a sensitivity value.
  • the method further comprises: collecting a plurality of ratings associated with a plurality of pieces of web content, wherein the plurality of ratings includes ordinal ratings and real-valued ratings; and determining an aggregate rating for the plurality of pieces of web content based on the collected plurality of ratings.
  • the method further comprises normalizing the aggregate rating by mapping the aggregate rating to an index-scaled rating.
  • the method further comprises: applying a severity weight to the index-scaled rating; and generating a severity-weighted index-scaled rating for the plurality of pieces of web content.
  • the method further comprises generating a combined risk rating by combining the generated risk rating that encodes whether the web content is likely to contain objectionable content of the given category with a second risk rating that encodes whether the web content is likely to contain objectionable content of a second category.
  • a system for rating webpages for safe advertising comprising a processor that: extracts one or more features from a piece of web content; applies a plurality of statistical models to the extracted features to generate a plurality of ordinomial estimates, wherein each ordinomial estimate represents a probability that the web content is a member of one of a plurality of severity groups; determines a posterior ordinomial estimate for the web content by combining the plurality of ordinomial estimates; generates a risk rating that encodes severity and confidence based on the determined posterior ordinomial estimate, wherein the risk rating identifies whether the web content is likely to contain objectionable content of a given category; and provides the risk rating for determining whether an advertisement should be associated with the web content.
  • a non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for rating webpages for safe advertising, the method comprising: extracting one or more features from a piece of web content; applying a plurality of statistical models to the extracted features to generate a plurality of ordinomial estimates, wherein each ordinomial estimate represents a probability that the web content is a member of one of a plurality of severity groups; determining a posterior ordinomial estimate for the web content by combining the plurality of ordinomial estimates; generating a risk rating that encodes severity and confidence based on the determined posterior ordinomial estimate, wherein the risk rating identifies whether the web content is likely to contain objectionable content of a given category; and providing the risk rating for determining whether an advertisement should be associated with the web content.
  • FIG. 1 is a diagram of an illustrative example of a process for determining the probability of membership in a severity group for a category of objectionable content in accordance with some embodiments of the disclosed subject matter.
  • FIG. 2 is a diagram of an illustrative example of combining ordinomial estimates into a posterior ordinomial estimate in accordance with some embodiments of the disclosed subject matter.
  • FIG. 3 is an illustrative example of temporal aggregation of posterior ordinomials in accordance with some embodiments of the disclosed subject matter.
  • FIG. 4 is an illustrative example of the map reduction approach (MapReduce) for determining the temporal aggregation of posterior ordinomials in accordance with some embodiments of the disclosed subject matter.
  • FIG. 5 is a diagram of an illustrative example of a process for generating one or more ratings for a webpage in accordance with some embodiments of the disclosed subject matter.
  • FIG. 6 is a diagram of a graph showing the selection of an appropriate bin (b i ) in an ordinomial given a confidence parameter ( ⁇ ) in accordance with some embodiments of the disclosed subject matter.
  • FIG. 7 is a diagram of an illustrative rating scale in accordance with some embodiments of the disclosed subject matter.
  • FIG. 8 is an illustrative example that incoming URLs can be matched to the sitelet with the longest available shared prefix in accordance with some embodiments of the disclosed subject matter.
  • FIG. 9 is an illustrative example of calculating sitelet ordinomials in accordance with some embodiments of the disclosed subject matter.
  • FIG. 10 is an illustrative example of calculating sitelet ordinomials and sitelet ratings in settings with small domains in accordance with some embodiments of the disclosed subject matter.
  • FIG. 11 is an illustrative example of calculating sitelet ordinomials and sitelet ratings in settings with larger domains in accordance with some embodiments of the disclosed subject matter.
  • FIG. 13 is a diagram of an illustrative system on which a rating application can be implemented in accordance with some embodiments of the disclosed subject matter.
  • FIG. 14 is a diagram of an illustrative user computer and server as provided, for example, in FIG. 13 in accordance with some embodiments of the disclosed subject matter.
  • mechanisms for scoring and rating web pages, web sites, and other pieces of content of interest to advertisers or content providers for safe and effective online advertising are provided. These mechanisms, among other things, generate a risk rating that accounts for the inclusion of objectionable content with the use of ordinomials.
  • the risk rating can, in some embodiments, represent the probability that a page or a site contains or will contain objectionable content, the degree of objectionability of the content, and/or any suitable combination thereof.
  • the risk rating can be determined for a single domain and/or a single category such that a particular piece of media or content can have a rating for each of a number of objectionable content categories.
  • the risk rating can be determined across several objectionable content categories, across multiple pieces of content (e.g., the pages appearing in the advertiser's traffic), and/or across multiple domains managed by a publisher.
  • these mechanisms can be generated using multiple statistical models and considering multiple pieces of evidence. In some embodiments, these mechanisms can account for temporal dynamics in content by determining a risk rating that is based on the probability of encountering different severity levels from a given URL and that is based on the types of estimated severity exhibited in the past.
  • these mechanisms can be used in a variety of applications.
  • these mechanisms can provide a rating application that allows advertisers, ad networks, publishers, site managers, and/or other entities to make risk-controlled decisions based at least in part on risk associated with a given webpage, website, or any other suitable content (generally referred to herein as a “webpage” or “page”).
  • these mechanisms can be provide a rating application that allows advertisers, agencies, advertisement networks, advertisement exchanges, and/or publishers to determine whether to place a particular advertisement on a particular web page based on a high confidence that the page does not contain objectionable content.
  • these mechanisms allow an advertiser to designate that an advertisement should not be placed on a web page unless a particular confidence (e.g., high confidence, medium-high confidence, etc.) is achieved.
  • the particular confidence may be determined based on having a severity greater than a particular severity group in a particular category.
  • advertisers, agencies, advertisement networks, advertisement exchanges, and publishers can request to view a list of pages in their current advertisement network traffic assessed to have the highest risk of objectionable content.
  • these categories can include content that relates to guns, bombs, and/or ammunition (e.g., sites that describe or provide information on weapons including guns, rifles, bombs, and ammunition, sites that display and/or discuss how to obtain weapons, manufacture of weapons, trading of weapons (whether legal or illegal), sites which describes or offer for sale weapons including guns, ammunition, and/or firearm accessories, etc.).
  • content that relates to guns, bombs, and/or ammunition e.g., sites that describe or provide information on weapons including guns, rifles, bombs, and ammunition, sites that display and/or discuss how to obtain weapons, manufacture of weapons, trading of weapons (whether legal or illegal), sites which describes or offer for sale weapons including guns, ammunition, and/or firearm accessories, etc.
  • these categories can include content relating to alcohol (e.g., sites that provide information relating to alcohol, sites that provide recipes for mixing drinks, sites that provide reviews and locations for bars, etc.), drugs (e.g., sites that provide instructions for or information about obtaining, manufacturing, or using illegal drugs), and/or tobacco (e.g., sites that provide information relating to smoking, cigarettes, chewing tobacco, pipes, etc.).
  • alcohol e.g., sites that provide information relating to alcohol, sites that provide recipes for mixing drinks, sites that provide reviews and locations for bars, etc.
  • drugs e.g., sites that provide instructions for or information about obtaining, manufacturing, or using illegal drugs
  • tobacco e.g., sites that provide information relating to smoking, cigarettes, chewing tobacco, pipes, etc.
  • these categories can include offensive language (e.g., sites that contain swear words, profanity, hard language, inappropriate phrases and/or expressions), hate speech (e.g., sites that advocate hostility or aggression towards individuals or groups on the basis of race, religion, gender, nationality, or ethnic origin, sites that denigrate others or justifies inequality, sites that purport to use scientific or other approaches to justify aggression, hostility, or denigration), and/or obscenities (e.g., sites that display graphic violence, the infliction of pain, gross violence, and/or other types of excessive violence).
  • these categories can include adult content (e.g., sites that contain nudity, sex, use of sexual language, sexual references, sexual images, and/or sexual themes).
  • these categories can include spyware or malicious code (e.g., sites that provide instructions to practice illegal or unauthorized acts of computer crime using technology or computer programming skills, sites that contain malicious code, etc.) or other illegal content (e.g., sites that provide instructions for threatening or violating the security of property or the privacy of others, such as theft-related sites, locking picking and burglary-related sites, fraud-related sites).
  • spyware or malicious code e.g., sites that provide instructions to practice illegal or unauthorized acts of computer crime using technology or computer programming skills, sites that contain malicious code, etc.
  • other illegal content e.g., sites that provide instructions for threatening or violating the security of property or the privacy of others, such as theft-related sites, locking picking and burglary-related sites, fraud-related sites.
  • objectionable content on one or more of these webpages can generally be defined as having a severity level worse than (or greater than) b j in a category y.
  • Each category (y) can include various severity groups b j , where j is greater than or equal to 1 through n and n is an integer greater than one.
  • an adult content category can have various severity levels, such as G, PG-13, PG, R, NC-17, and X.
  • an adult content category and an offensive speech category can be combined to form one category of interest.
  • a category may not have fine grained severity groups and a binomial distribution can be used.
  • a binomial probability can be used for binary outcome events, where there is typically one positive event (e.g., good, yes, etc.) and one negative event (e.g., bad, no, etc.).
  • FIG. 1 is a diagram showing an example of a process for determining the probability of membership in a severity group for one or more category of objectionable content in accordance with some embodiments of the disclosed subject matter.
  • process 100 begins by receiving or reviewing content on a webpage, website, or any other suitable content (generally referred to herein as a “webpage” or “page”) at 110 .
  • a rating application can receive multiple requests to rate a group of webpages or websites.
  • a rating application can receive, from an advertiser, a list of websites that the advertiser is interested in placing an advertisement provided that each of these websites does not contain or does not have a high likelihood of containing objectionable content.
  • a rating application can receive, from an advertiser, that advertiser's current advertisement network traffic for assessment.
  • the rating application or a component of the rating application selects a uniform resource locator (URL) for rating at 120 .
  • the rating application can receive one or more requests from other components (e.g., the most popular requests are assigned a higher priority, particular components of the rating application are assigned a higher priority, or random selection from the requests).
  • a fixed, prioritized list of URLs can be defined based, for example, on ad traffic or any other suitable input (e.g., use of the rating for scoring, use of the rating for active learning, etc.).
  • One or more pieces of evidence can be extracted from the uniform resource locator or page at 130 .
  • These pieces of evidence can include, for example, the text of the URL, image analysis, HyperText Markup Language (HTML) source code, site or domain registration information, ratings, categories, and/or labeling from partner or third party analysis systems (e.g., site content categories), source information of the images on the page, page text or any other suitable semantic analysis of the page content, metadata associated with the page, anchor text on other pages that point to the page of interest, ad network links and advertiser information taken from a page, hyperlink information, malicious code and spyware databases, site traffic volume data, micro-outsourced data, any suitable auxiliary derived information (e.g., ad-to-content ratio), and/or any other suitable combination thereof.
  • evidence and/or any other suitable information relating to the page can be collected, extracted, and/or derived using one or more evidentiary sources.
  • an ordinomial can be generated at 140 .
  • a multi-severity classification can be determined by using an ordinomial to encode the probability of membership in an ordered set of one or more severity groups.
  • the ordinomial can be represented as follows:
  • y is a variable representing the severity class that page x belongs to. It should be noted that the ordinal nature implies that b i is less severe than b j , when i ⁇ j. It should also be noted that ordinomial probabilities can be estimated using any suitable statistical models, such as the ones described herein, and using the evidence derived from the pages.
  • an ordinomial distribution that includes each generated ordinomial for one or more severity groups can be generated. Accordingly, the cumulative ordinal distribution F can be described as:
  • a category may not have fine grained severity groups and a binomial distribution can be used.
  • a binomial probability can be used for binary outcome events, where there is typically one positive event (e.g., good, yes, etc.) and one negative event (e.g., bad, no, etc.).
  • a binary or binomial-probability determination of appropriateness or objectionability can be projected onto an ordinomial by considering the extreme classes—b 1 and b n .
  • a binomial determination can be performed, where the extreme classes include one positive class (e.g., malware is present in the content) and one negative class (e.g., malware is not present in the content).
  • Ordinomial probabilities can be estimated using one or more statistical models, for example, from evidence derived or extracted from the received web pages.
  • process 100 of FIG. 1 and other processes described herein some steps can be added, some steps may be omitted, the order of the steps may be rearranged, and/or some steps may be performed simultaneously.
  • ordinomials can be generated from a variety of different statistical models based on a diverse range of evidence. For example, different pieces of evidence can be accounted for in the determination of an ordinomials. These ordinomial estimates can be combined into a posterior ordinomial estimate using, for example, ensemble approaches and information fusion approaches.
  • example aggregation approaches include weighted averaging, AdaBoost-type mixing, or using sub-ordinomials as covariates in a secondary model. Accordingly, as shown in FIG. 2 , this can be represented as:
  • the rating application can provide temporal aggregation features to account for the change to web pages over time.
  • FIG. 4 shows an illustrative example of the map reduction approach (MapReduce) for determining the temporal aggregation of posterior ordinomials in accordance with some embodiments of the disclosed subject matter.
  • MapReduce map reduction approach
  • URLs can be used as the key for the reduction phase of the MapReduce process. This has the effect of compiling all samples that belong to a given domain onto a single computer during the reduction.
  • the ordinomials probabilities and the timestamp denoting the instant the ordinomials probability sample was made are passed. More particularly, as shown in FIG. 4 , the posterior ordinomials for a given domain can be sorted based on the timestamp or observation time.
  • Probability estimates can then be performed, where the sorted posterior ordinomials for a given domain are combined and an expected posterior ordinomials is calculated. Depending on the computational nature of the temporal aggregation, this expected ordinomial can be stored for use in future temporal aggregations, thereby alleviating the need for explicit storage of each individual record. Additionally, the reduction phase of this MapReduce process can compute and output a rating as described herein.
  • FIG. 5 is a diagram of an example of a process 500 for generating a rating (R) for a webpage in accordance with some embodiments of the disclosed subject matter.
  • p(y b i
  • x) that includes severity and confidence parameters is determined.
  • an advertiser may desire that the rating represents a particular confidence that the page's content is no worse than severity group b j .
  • an advertiser may desire that the rating encodes the confidence that a particular webpage is no better than a particular severity group.
  • process 500 begins by selecting the worst severity in accordance with a user specified confidence parameter ( ⁇ ) at 510 .
  • a user specified confidence parameter
  • FIG. 6 starting from the least severe or objectionable category in the ordinomial (b 1 ), the bins of the ordinomial are ascended, maintaining a sum of the probabilities encountered.
  • the bin, b i where the level of confidence ( ⁇ ) is reached can be represented by:
  • the bin, b i is selected such that the application has at least the level of confidence ( ⁇ ) that the content is no worse than b i .
  • the rating application can determine ratings from a given page's ordinomial probability estimates and encodes both severity and confidence. It should be noted that the rating application can assume that ratings are given on a numeric scale that can be divided into ranges B j , where there is a one-to-one mapping between these ranges and the b j . That is, step 510 of process 500 indicates that there is a particular confidence that a page has severity no worse than b j , and the rating (R) is somewhere in the range B j . For example, as shown in FIG.
  • the rating scale 700 can be a numeric scale of the numbers 0 through 1000, where 1000 denotes the least severe end or the highly safe portion of the scale.
  • rating scale 700 can be further divided such that particular portions of rating scale are determined to be the best pages—e.g., ratings falling between 800 and 1000. Accordingly, if a greater than confidence that the page's content is no worse than the best category, then the page's rating falls in the 800-1000 range.
  • interior rating ranges for a particular objectionability category can be defined.
  • the rating application can generate one or more ratings that take into account the difference between being uncertain between R rated content and PG rated content, where R and PG are two interior severity levels within the adult content category.
  • the rating application can generate one or more ratings that take into account the difference between a page having no evidence of X rated content and a page having some small evidence of containing X rating content.
  • rating range B j can be defined as s j-1 and s j .
  • one or more ratings can be generated for one or more objectionable categories. For example, multiple ratings can be generated, where one rating is generated for each selected objectionable content category (e.g., adult content, offensive language, and alcohol).
  • selected objectionable content category e.g., adult content, offensive language, and alcohol.
  • ratings for two or more objectionable categories can be combined to create a combined score. For example, a first rating generated for an adult content category and a second rating generated for an offensive language category can be combined.
  • weights can be assigned to each category such that a higher weight can be assigned to the adult content category and a lower weight can be assigned to the offensive language category. Accordingly, an advertiser or any other suitable user of the rating application can customize the score by assigning weights to one or more categories.
  • a multi-dimensional rating vector can be created that represents, for each site, the distribution of risk of adjacency to objectionable content along different dimensions: guns, bombs and ammunition; alcohol; offensive language; hate speech, tobacco; spyware and malicious code; illegal drugs; adult content, gaming and gambling; entertainment; illegality; and/or obscenity.
  • the rating application can determine a rating for a sitelet.
  • a sitelet is a collection or subset of web pages and, more particularly, is often a topically homogeneous portion of a page, such as a topic-oriented subtree of a large site's hierarchical tree structure. For example, “finance.yahoo.com” can receive a rating as a sitelet of the website “yahoo.com.”
  • the rating application can rate sitelets as there are web pages that the rating application has never seen before. However, that does not mean that the rating application has no evidence with which to rate the page. There is substantial rating locality within sitelets. A page from a risky site or sitelet is risky itself.
  • the rating application can rate sitelets for computational storage efficiency as it may not be necessary to save or store the scores for individual pages if they are not significantly different from the scores for the sitelet. For example, if the ratings for the individual pages that make up website www.foo.com are within a given threshold value (e.g., a 5% difference), the rating application can store a rating for a sitelet (a collection of those individual pages). It should also be noted that sitelet scores can provide additional evidence to the rating computation even when the page has been seen before.
  • advertising on a website can be an indication of direct financial support of the website. Even if a particular page does not contain objectionable content or is determined to not likely contain objectionable content, an advertiser may not want to support a site that otherwise promotes objectionable categories of content.
  • the rating application can provide an indication when a particular news item promotes or supports a major Vietnamese website.
  • the rating application can provide an indication when a particular advertiser that supports or advertises on a particular website falls in an objectionable category.
  • the rating application can detect whether the content falls within an objectionable category and whether advertisers, promoters, or other entities associated with the content fall within an objectionable category.
  • FIG. 8 shows an illustrative example that incoming URLs can be matched to the sitelet with the longest available shared prefix. The aggregated ordinomials and associated rating of this longest prefix are then used for the query URL. Radix trees can, in some embodiments, be used to make this query computationally efficient.
  • a rating for every URL or sub-string in the file tree implied by a domain's URLs need not be stored explicitly. If the rating for a page or sub-tree is not significantly different from that of its parents, then explicit storage offers little additional benefit at the expense of increased storage and computation.
  • sensitivity parameter or threshold
  • R( ⁇ ) denotes the rating for an entity
  • c denotes the child page or subtree whose rating is under consideration
  • p denotes the parent of child page c.
  • sitelet ratings can be generated from sitelet ordinomials.
  • the sitelet ordinomials can be produced by an aggregation process over the pages in the sitelet.
  • the sitelet ordinomial can be a weighted combination of the page ordinomials, a Bayesian combination, or generated using any suitable explicit mathematical function.
  • FIG. 9 shows an illustrative example of calculating sitelet ordinomials in accordance with some embodiments of the disclosed subject matter.
  • the pages in the sitelet can be considered as a large set, or the tree structure can be taken into account explicitly. In the latter case, the calculation can be done efficiently by recursion.
  • the base step is to calculate the rating at the root node. Then, for each step, the ratings for all the children are calculated. For each child, the inequality
  • sitelet ordinomials can be efficiently calculated using a map reduction process in accordance with some embodiments of the disclosed subject matter.
  • the rating application can generate ratings using a single pass via MapReduce or any other suitable mapping approach.
  • the reduction phase is performed using the domain as a key. Once the URLs belonging to a domain are assembled together, a file tree or domain tree can be generated, and the above-mentioned calculation of sitelet ordinomials can be used to find pertinent ratings in a domain.
  • FIG. 11 shows that sitelet ordinomials can be efficiently calculated using a map reduction process for settings with larger domains.
  • the reduction via MapReduce can occur iteratively.
  • M denote the number of suffixes in the largest domain.
  • t from M to 1, combine all ordinomials at level t in accordance to the inequality
  • the inequality returns false, all children are stored for rating and sitelet computation at higher levels. This may lead to an unacceptable demand for memory and resources.
  • children with the same or very similar ratings can be combined using explicit combination functions, for example, Bayesian or weighted averaging.
  • Those children that have a difference in rating of at least ⁇ are stored explicitly as their own sitelet rating. Each step reduces t by one: t ⁇ t ⁇ 1. This is repeated until t is equivalent to 1, where the rating and sitelet are calculated and stored to ensure all URLs present in a domain receive some rating.
  • the rating application can calculate ratings using both temporal aggregation and sitelet aggregation. Generally speaking, the rating application accomplishes this by performing the temporal aggregation on URLs at the first step of sitelet aggregation. For example, as shown in FIG. 12 , the rating application can aggregate posterior ordinomials for all times (t), a reduction phase is performed using the domain as a key, and, once the URLs belonging to a domain are assembled together, a file tree or domain tree can be generated. The expected ordinomial for each URL can then be calculated.
  • mechanisms are provided for evaluating the quality of collections of online media and other suitable content. Because online media is often purchased by advertisers at different levels of granularity (e.g., ranging from individual pages to large sets of domains), it is desirable to develop metrics for comparing the quality of such diverse sets of content. More particularly, these mechanisms, among other things, collect individual content ratings, aggregate these ratings across arbitrary subsets, normalize these ratings to be on a general index scale, and calibrate the normalized ratings such that the global mean provides a benchmark for comparison.
  • the application calculates several metrics for particular content (e.g., media, web pages, etc.). For example, in the case of objectionable content, a category can have metrics encapsulating the risk related to the appearance of adult content, metrics encapsulating the risk related to the appearance or use of hate speech, etc. Accordingly, in some embodiments, the application can provide a single metric encapsulating the different aspects of the content.
  • x j refers to an individual example of a piece of online media or online content, such as a particular web page, video, or image.
  • x j refers to an individual example of a piece of online media or online content, such as a particular web page, video, or image.
  • the multiple risk ratings can be combined into a single concise metric, r(x j ), using, for example, a specialized combination function, h, such that:
  • r ( x j ) h ( r (1) ( x j ), . . . , r (M) ( x j ))
  • example combination functions include weighted averaging, where the weights are set to the importance of particular objectionable content categories, Bayesian mixing, a secondary combining model, and/or a simple minimum function that determines the most risky category in the case of a brand safety model.
  • weights are set to the importance of particular objectionable content categories
  • Bayesian mixing e.g., Bayesian mixing
  • secondary combining model e.g., a secondary combining model
  • simple minimum function that determines the most risky category in the case of a brand safety model.
  • multiple combining functions can also be used and aggregated to create the single concise metric.
  • the single concise metric can, for example, be used to compare diverse sets of content.
  • the application can allow the advertiser to compare the content management by two different advertising networks.
  • r(•) can be ordinal, where r(•) ⁇ V 0 , . . . , V d ⁇ , such that without loss of generality, V 0 ⁇ V 1 ⁇ . . . ⁇ V d .
  • the ratings r(•) can also be real-valued, where (•) ⁇ .
  • r(x j ) can provide a measure that includes both the quality (or severity) of x j , and the confidence that x j deserves that level of quality. That is, the rating application can provide a rating r(•) that combines both the likelihood and the severity of content considered.
  • online media is often packaged into arbitrary collections when being traded in the online advertising marketplace. Additionally, natural boundaries may exist, segregating a collection of content into distinct subsets. Given a rating defined on individual examples in this content space, r(•), it can be desirable to combine the ratings on individual pages into aggregate ratings denoting the expected rating of an entire subset of content.
  • X denote a collection of media, for example, the media holdings of an online publisher having a particular category of web pages, such as pages related to sports, of the pages offered by a supply-side advertising network, including any subsets thereof.
  • the rating application can aggregate the ratings of content in this collection, x ⁇ X.
  • ⁇ (•) is an indicator function that takes the value 1 when the operand is true, and zero otherwise. This corresponds to the most common ordinal value in the collection. It should also be noted that ties may be broken arbitrarily, for example, by choosing the most severe category in the tie, for safety.
  • r agg 1 ⁇ X ⁇ ⁇ ⁇ x ⁇ X ⁇ r ⁇ ( x ) .
  • the rating application When aggregating content ratings, the rating application considers that content may be presented in a pre-aggregated form.
  • the input may be domains, each with an aggregate rating.
  • Y 1 be a collection of one or more examples of content
  • x ⁇ Y l Let X then be extended to be a collection of such collections, Y l ⁇ X. Rating aggregation can then be extended to such sub-aggregations of content.
  • r agg argmax V ⁇ Y l ⁇ X
  • ⁇ ( r agg ( Y ) V ),
  • r agg 1 ⁇ X ⁇ ⁇ ⁇ Y l ⁇ X ⁇ ⁇ Y l ⁇ ⁇ r agg ⁇ ( Y ) ,
  • the rating application takes unconstrained, real-valued ratings and projects them onto a bounded region of the number line for ease of comparison.
  • This mapping to the number value assigned to each ordinal category can be constructed to capture the risk and severity profiles of content in each respective category.
  • the rating application can be configured to define an index-scaled rating to be a numerical rating assigned to online media constrained to the range r i (x) ⁇ [ ⁇ , ⁇ ].
  • This rating is assumed to capture both the severity and risk of appearance of online media, with r i (x j ) ⁇ r i (x k ) implying that x k , is at least a risky as x j —there is a greater chance of riskier content appearing on x k than on x j .
  • This implies that xj is likely to be safer for brand advertisers or other online media buyers.
  • the rating application Given an index-scaled rating, r i (x), on a particular example x, the rating application defines a mapping from an unscaled rating to a scaled rating for both ordinal and real-valued ratings into r i (x).
  • index-scaled rating For ordinal ratings, the index-scaled rating can be expressed as:
  • mapping to an index-scaled rating is performed by assigning a constant, a, to each ordinal non-index scaled rating.
  • a a constant
  • r(•) ⁇ V 0 , . . . , V d ⁇ the mapping to an index-scaled rating.
  • a ⁇ [ ⁇ , ⁇ ] e.g., a is bounded by the index-scaled rating range and without loss of generality a V m ⁇ a V n whenever V m ⁇ V n
  • more risky ordinal categories have lower numerical values in the mapping.
  • index-scaled rating For real-valued ratings, the index-scaled rating can be expressed as:
  • f (•) is a monotonic function. For example, f(r(x j )) ⁇ f(r(x k )) whenever r(x j ) ⁇ r(x k ). That is, lower unscaled ratings tend to get lower scaled ratings. Additionally, it should be noted that, the range of f(•) is [ ⁇ , ⁇ ].
  • the rating application can transform arbitrary raw ratings into an index-scaled rating.
  • a numerical rating can encode the likelihood of encountering risky or inappropriate content on a given example of online media, in addition to the likely severity of such content.
  • the resulting index-scaled rating represents the value of online content to buyers and advertisers, with risky and severely inappropriate content generally being of low value.
  • the rating application can be configured to aggregate ratings for collected content, x ⁇ Xm with commensurate impact of riskier individual pages. This can be represented as follows:
  • r i , agg ⁇ x ⁇ X ⁇ r i ⁇ ( x ) ⁇ x ⁇ X ⁇ w ⁇ ( r i ⁇ ( x ) )
  • w(•) ⁇ [1, ⁇ ) represents a weight function associated with a content rating. More particularly, content that is riskier receives both a lower numerical rating and contributes to a higher total weight, thereby lowering the expected score via a lower denominator.
  • the rating application creates four risk buckets—e.g., very high risk, high risk, moderate risk, and low risk, each with ranges of an index-scaled rating. For a given aggregation of content, the rating application also denotes the number of examples in each by r 1 , r 2 , r 3 , and r 4 , respectively.
  • the rating application can also assign a native index-scaled rating to each bucket.
  • the rating application can assign 50, 100, 150, and 200 to each bucket, respectively.
  • the rating application can provide combination weights for each category.
  • the application can assign the combination weights of 35.2, 8.8, 2.2, and 1.0 for each bucket, respectively. Accordingly, a severity weight aggregation of such content can be determined by calculating:
  • r i , agg 50 ⁇ ⁇ r 1 + 100 ⁇ ⁇ r 2 + 150 ⁇ ⁇ r 3 + 200 ⁇ r 4 35.2 ⁇ ⁇ r 1 + 8.8 ⁇ r 2 + 2.2 ⁇ r 3 + 1.0 ⁇ r 4
  • the rating application not only considers how a content rates with respect to risk and severity, but also determines how that content compares to other similar content. In order to perform such a comparison, the rating application recalibrates ratings to the mean rating of content being considered.
  • the mean ( ⁇ r ) of the uncalibrated set of ratings can be determined by calculating:
  • ⁇ r 1 ⁇ X ⁇ ⁇ ⁇ x ⁇ X ⁇ r i ⁇ ( x )
  • gamma ( ⁇ ) can denote a value that the mean is mapped after calibration and Y j can denote a subset of content in X.
  • the rating application then defines a calibration of Y j 's rating, r c relating to ⁇ r using the following cases:
  • the re-calibration can be performed by determining:
  • FIG. 13 is a generalized schematic diagram of a system 1300 on which the rating application may be implemented in accordance with some embodiments of the disclosed subject matter.
  • system 1300 may include one or more user computers 1302 .
  • User computers 1302 may be local to each other or remote from each other.
  • User computers 1302 are connected by one or more communications links 1304 to a communications network 1306 that is linked via a communications link 1308 to a server 1310 .
  • System 1300 may include one or more servers 1310 .
  • Server 1310 may be any suitable server for providing access to the application, such as a processor, a computer, a data processing device, or a combination of such devices.
  • the application can be distributed into multiple backend components and multiple frontend components or interfaces.
  • backend components such as data collection and data distribution can be performed on one or more servers 1310 .
  • the graphical user interfaces displayed by the application such as a data interface and an advertising network interface, can be distributed by one or more servers 1310 to user computer 1302 .
  • each of the client 1302 and server 1310 can be any of a general purpose device such as a computer or a special purpose device such as a client, a server, etc.
  • a general purpose device such as a computer
  • a special purpose device such as a client, a server, etc.
  • Any of these general or special purpose devices can include any suitable components such as a processor (which can be a microprocessor, digital signal processor, a controller, etc.), memory, communication interfaces, display controllers, input devices, etc.
  • client 1302 can be implemented as a personal computer, a personal data assistant (PDA), a portable email device, a multimedia terminal, a mobile telephone, a set-top box, a television, etc.
  • PDA personal data assistant
  • any suitable computer readable media can be used for storing instructions for performing the processes described herein, can be used as a content distribution that stores content and a payload, etc.
  • computer readable media can be transitory or non-transitory.
  • non-transitory computer readable media can include media such as magnetic media (such as hard disks, floppy disks, etc.), optical media (such as compact discs, digital video discs, Blu-ray discs, etc.), semiconductor media (such as flash memory, electrically programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), etc.), any suitable media that is not fleeting or devoid of any semblance of permanence during transmission, and/or any suitable tangible media.
  • transitory computer readable media can include signals on networks, in wires, conductors, optical fibers, circuits, any suitable media that is fleeting and devoid of any semblance of permanence during transmission, and/or any suitable intangible media.
  • communications network 1306 may be any suitable computer network including the Internet, an intranet, a wide-area network (“WAN”), a local-area network (“LAN”), a wireless network, a digital subscriber line (“DSL”) network, a frame relay network, an asynchronous transfer mode (“ATM”) network, a virtual private network (“VPN”), or any combination of any of such networks.
  • Communications links 1304 and 1308 may be any communications links suitable for communicating data between user computers 1302 and server 1310 , such as network links, dial-up links, wireless links, hard-wired links, any other suitable communications links, or a combination of such links.
  • User computers 1302 enable a user to access features of the application.
  • User computers 1302 may be personal computers, laptop computers, mainframe computers, dumb terminals, data displays, Internet browsers, personal digital assistants (“PDAs”), two-way pagers, wireless terminals, portable telephones, any other suitable access device, or any combination of such devices.
  • User computers 1302 and server 1310 may be located at any suitable location. In one embodiment, user computers 1302 and server 1310 may be located within an organization. Alternatively, user computers 1302 and server 1310 may be distributed between multiple organizations.
  • user computer 1302 may include processor 1402 , display 1404 , input device 1406 , and memory 1408 , which may be interconnected.
  • memory 1408 contains a storage device for storing a computer program for controlling processor 1402 .
  • Processor 1402 uses the computer program to present on display 1404 the application and the data received through communications link 1304 and commands and values transmitted by a user of user computer 1302 . It should also be noted that data received through communications link 1304 or any other communications links may be received from any suitable source.
  • Input device 1406 may be a computer keyboard, a cursor-controller, dial, switchbank, lever, or any other suitable input device as would be used by a designer of input systems or process control systems.
  • Server 1310 may include processor 1420 , display 1422 , input device 1424 , and memory 1426 , which may be interconnected.
  • memory 1426 contains a storage device for storing data received through communications link 1308 or through other links, and also receives commands and values transmitted by one or more users.
  • the storage device further contains a server program for controlling processor 1420 .
  • the application may include an application program interface (not shown), or alternatively, the application may be resident in the memory of user computer 1302 or server 1310 .
  • the only distribution to user computer 1302 may be a graphical user interface (“GUI”) which allows a user to interact with the application resident at, for example, server 1310 .
  • GUI graphical user interface
  • the application may include client-side software, hardware, or both.
  • the application may encompass one or more Web-pages or Web-page portions (e.g., via any suitable encoding, such as HyperText Markup Language (“HTML”), Dynamic HyperText Markup Language (“DHTML”), Extensible Markup Language (“XML”), JavaServer Pages (“JSP”), Active Server Pages (“ASP”), Cold Fusion, or any other suitable approaches).
  • HTTP HyperText Markup Language
  • DHTML Dynamic HyperText Markup Language
  • XML Extensible Markup Language
  • JSP JavaServer Pages
  • ASP Active Server Pages
  • Cold Fusion or any other suitable approaches.
  • the application is described herein as being implemented on a user computer and/or server, this is only illustrative.
  • the application may be implemented on any suitable platform (e.g., a personal computer (“PC”), a mainframe computer, a dumb terminal, a data display, a two-way pager, a wireless terminal, a portable telephone, a portable computer, a palmtop computer, an H/PC, an automobile PC, a laptop computer, a cellular phone, a personal digital assistant (“PDA”), a combined cellular phone and PDA, etc.) to provide such features.
  • PC personal computer
  • mainframe computer e.g., a mainframe computer, a dumb terminal, a data display, a two-way pager, a wireless terminal, a portable telephone, a portable computer, a palmtop computer, an H/PC, an automobile PC, a laptop computer, a cellular phone, a personal digital assistant (“PDA”), a combined cellular phone and PDA, etc.
  • PDA personal
  • a procedure is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. These steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared and otherwise manipulated. It proves convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like. It should be noted, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities.
  • the manipulations performed are often referred to in terms, such as adding or comparing, which are commonly associated with mental operations performed by a human operator. No such capability of a human operator is necessary, or desirable in most cases, in any of the operations described herein which form part of the present invention; the operations are machine operations.
  • Useful machines for performing the operation of the present invention include general purpose digital computers or similar devices.
  • the present invention also relates to apparatus for performing these operations.
  • This apparatus may be specially constructed for the required purpose or it may comprise a general purpose computer as selectively activated or reconfigured by a computer program stored in the computer.
  • the procedures presented herein are not inherently related to a particular computer or other apparatus.
  • Various general purpose machines may be used with programs written in accordance with the teachings herein, or it may prove more convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these machines will appear from the description given.

Abstract

Systems, methods, and media for rating websites for safe advertising are provided. In accordance with some embodiments of the disclosed subject matter, the method comprises: extracting one or more features from a piece of web content; applying a plurality of statistical models to the extracted features to generate a plurality of ordinomial estimates, wherein each ordinomial estimate represents a probability that the web content is a member of one of a plurality of severity groups; determining a posterior ordinomial estimate for the web content by combining the plurality of ordinomial estimates; generating a risk rating that encodes severity and confidence based on the determined posterior ordinomial estimate, wherein the risk rating identifies whether the web content is likely to contain objectionable content of a given category; and providing the risk rating for determining whether an advertisement should be associated with the web content.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of U.S. patent application Ser. No. 13/151,146, filed Jun. 1, 2011, which claims the benefit of U.S. Provisional Patent Application No. 61/350,393, filed Jun. 1, 2010 and U.S. Provisional Patent Application No. 61/431,789, filed Jan. 11, 2011, which are hereby incorporated by reference herein in their entireties.
  • This application is also related to U.S. patent application Ser. No. 12/859,763, filed Aug. 19, 2010, which is hereby incorporated by reference herein in its entirety.
  • FIELD OF THE INVENTION
  • The disclosed subject matter generally relates to methods, systems, and media for applying scores and ratings to web pages, web sites, and other pieces of content of interest to advertisers or content providers for safe and effective online advertising.
  • BACKGROUND OF THE INVENTION
  • Brands are carefully crafted and incorporate a firm's image as well as a promise to the firm's stakeholders. Unfortunately, in the current online environment, advertising networks may juxtapose advertisements that represent such brands with undesirable content due to the opacity of the ad-placement process and possibly to a misalignment of incentives in the ad-serving ecosystem. Currently, neither the ad network nor the brand can efficiently recognize whether a website contains or has a tendency to contain questionable content.
  • Online advertisers use tools that provide information about websites or publishers and the viewers of such websites to facilitate more effective planning and management of online advertising by advertisers. Moreover, online advertisers continually desire increased control over the web pages on which their advertisements and brand messages appear. For example, particular online advertisers want to control the risk that their advertisements and brand messages appear on pages or sites that contain objectionable content (e.g., pornography or adult content, hate speech, bombs, guns, ammunition, alcohol, offensive language, tobacco, spyware, malicious code, illegal drugs, music downloading, particular types of entertainment, illegality, obscenity, etc.). In another example, advertisers for adult-oriented products, such as alcohol and tobacco, want to avoid pages directed towards children. In yet another example, particular online advertisers want to increase the probability that their content appears on specific sorts of sites (e.g., websites containing news-related information, websites containing entertainment-related information, etc.). However, current advertising tools merely categorize websites into categories indicating that a web site contains a certain sort of content.
  • There is therefore a need in the art for approaches for applying scores and ratings to web pages, web sites, and content for safe and effective online advertising. Accordingly, it is desirable to provide methods, systems, and media that overcome these and other deficiencies of the prior art.
  • For example, the disclosed subject matter provides advertisers, agencies, advertisement networks, advertisement exchanges, and publishers with the ability to make risk-controlled decisions based on the category-specific risk and/or general risk associated with a given web page, website, etc. In a more particular example, advertisers, agencies, advertisement networks, advertisement exchanges, and publishers can determine whether to place a particular advertisement on a particular web page based on a high confidence that the page does not contain objectionable content. In another more particular example, advertisers, agencies, advertisement networks, advertisement exchanges, and publishers can request to view a list of pages in their current advertisement network traffic assessed to have the highest risk of objectionable content.
  • SUMMARY OF THE INVENTION
  • In accordance with various embodiments of the disclosed subject matter, mechanisms for scoring and rating web pages, web sites, and other pieces of content of interest to advertisers or content providers for safe and effective online advertising are provided.
  • These mechanisms, among other things, generate a risk rating that accounts for the inclusion of objectionable content with the use of ordinomials. The risk rating can, in some embodiments, represent the probability that a page or a site contains or will contain objectionable content, the degree of objectionability of the content, and/or any suitable combination thereof. In a more particular example, the risk rating can be determined for a single domain and/or a single category such that a particular piece of media or content can have a rating for each of a number of objectionable content categories. Alternatively, in another more particular example, the risk rating can be determined across several objectionable content categories, across multiple pieces of content (e.g., the pages appearing in the advertiser's traffic), and/or across multiple domains managed by a publisher.
  • In some embodiments, these mechanisms can be generated using multiple statistical models and considering multiple pieces of evidence. In some embodiments, these mechanisms can account for temporal dynamics in content by determining a risk rating that is based on the probability of encountering different severity levels from a given URL and that is based on the types of estimated severity exhibited in the past.
  • In some embodiments, these mechanisms can evaluate the quality of collections of content. More particularly, these mechanisms can collect individual content ratings (e.g., ordinal ratings and/or real-valued ratings), aggregate these ratings across arbitrary subsets, normalize these ordinal and real-valued ratings onto a general index scale, and calibrate and/or map the normalized ratings using a global mean to provide a benchmark for comparison. This mapping can capture the risk and/or severity profiles of appearance of content.
  • Systems, methods, and media for rating websites for safe advertising are provided. In accordance with some embodiments of the disclosed subject matter, the method comprises: extracting one or more features from a piece of web content; applying a plurality of statistical models to the extracted features to generate a plurality of ordinomial estimates, wherein each ordinomial estimate represents a probability that the web content is a member of one of a plurality of severity groups; determining a posterior ordinomial estimate for the web content by combining the plurality of ordinomial estimates; generating a risk rating that encodes severity and confidence based on the determined posterior ordinomial estimate, wherein the risk rating identifies whether the web content is likely to contain objectionable content of a given category; and providing the risk rating for determining whether an advertisement should be associated with the web content.
  • In some embodiments, the method further comprises: determining a plurality of posterior ordinomial estimates at a plurality of times for the web content; and determining an expected posterior ordinomial estimate by combining the plurality of posterior ordinomial estimates over the plurality of times.
  • In some embodiments, the method further comprises: extracting a uniform resource locator from the one or more features; assembling a first set of posterior ordinomial estimates from the plurality of posterior ordinomial estimates based on the uniform resource locator; and determining the expected posterior ordinomial estimate by combining the first set of posterior ordinomial estimates over the plurality of times.
  • In some embodiments, the method further comprises: determining that the web content belongs to a sitelet, wherein the sitelet includes a plurality of web pages; determining a sitelet ordinomial by aggregating the plurality of posterior ordinomial estimates associated with each of the plurality of web pages; and generating a sitelet rating based on the aggregated plurality of posterior ordinomials.
  • In some embodiments, the method further comprises: comparing the sitelet ordinomial with the plurality of posterior ordinomial estimates associated with each of the plurality of web pages belonging to the sitelet; and determining whether to store at least one of the sitelet ordinomial and the plurality of posterior ordinomial estimates based on the comparison and a sensitivity value.
  • In some embodiments, the method further comprises: collecting a plurality of ratings associated with a plurality of pieces of web content, wherein the plurality of ratings includes ordinal ratings and real-valued ratings; and determining an aggregate rating for the plurality of pieces of web content based on the collected plurality of ratings.
  • In some embodiments, the method further comprises normalizing the aggregate rating by mapping the aggregate rating to an index-scaled rating.
  • In some embodiments, the method further comprises: applying a severity weight to the index-scaled rating; and generating a severity-weighted index-scaled rating for the plurality of pieces of web content.
  • In some embodiments, the method further comprises generating a combined risk rating by combining the generated risk rating that encodes whether the web content is likely to contain objectionable content of the given category with a second risk rating that encodes whether the web content is likely to contain objectionable content of a second category.
  • In some embodiments, a system for rating webpages for safe advertising is provided, the system comprising a processor that: extracts one or more features from a piece of web content; applies a plurality of statistical models to the extracted features to generate a plurality of ordinomial estimates, wherein each ordinomial estimate represents a probability that the web content is a member of one of a plurality of severity groups; determines a posterior ordinomial estimate for the web content by combining the plurality of ordinomial estimates; generates a risk rating that encodes severity and confidence based on the determined posterior ordinomial estimate, wherein the risk rating identifies whether the web content is likely to contain objectionable content of a given category; and provides the risk rating for determining whether an advertisement should be associated with the web content.
  • In some embodiments, a non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for rating webpages for safe advertising, the method comprising: extracting one or more features from a piece of web content; applying a plurality of statistical models to the extracted features to generate a plurality of ordinomial estimates, wherein each ordinomial estimate represents a probability that the web content is a member of one of a plurality of severity groups; determining a posterior ordinomial estimate for the web content by combining the plurality of ordinomial estimates; generating a risk rating that encodes severity and confidence based on the determined posterior ordinomial estimate, wherein the risk rating identifies whether the web content is likely to contain objectionable content of a given category; and providing the risk rating for determining whether an advertisement should be associated with the web content.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Various objects, features, and advantages of the disclosed subject matter can be more fully appreciated with reference to the following detailed description of the invention when considered in connection with the following drawing, in which like reference numerals identify like elements.
  • FIG. 1 is a diagram of an illustrative example of a process for determining the probability of membership in a severity group for a category of objectionable content in accordance with some embodiments of the disclosed subject matter.
  • FIG. 2 is a diagram of an illustrative example of combining ordinomial estimates into a posterior ordinomial estimate in accordance with some embodiments of the disclosed subject matter.
  • FIG. 3 is an illustrative example of temporal aggregation of posterior ordinomials in accordance with some embodiments of the disclosed subject matter.
  • FIG. 4 is an illustrative example of the map reduction approach (MapReduce) for determining the temporal aggregation of posterior ordinomials in accordance with some embodiments of the disclosed subject matter.
  • FIG. 5 is a diagram of an illustrative example of a process for generating one or more ratings for a webpage in accordance with some embodiments of the disclosed subject matter.
  • FIG. 6 is a diagram of a graph showing the selection of an appropriate bin (bi) in an ordinomial given a confidence parameter (β) in accordance with some embodiments of the disclosed subject matter.
  • FIG. 7 is a diagram of an illustrative rating scale in accordance with some embodiments of the disclosed subject matter.
  • FIG. 8 is an illustrative example that incoming URLs can be matched to the sitelet with the longest available shared prefix in accordance with some embodiments of the disclosed subject matter.
  • FIG. 9 is an illustrative example of calculating sitelet ordinomials in accordance with some embodiments of the disclosed subject matter.
  • FIG. 10 is an illustrative example of calculating sitelet ordinomials and sitelet ratings in settings with small domains in accordance with some embodiments of the disclosed subject matter.
  • FIG. 11 is an illustrative example of calculating sitelet ordinomials and sitelet ratings in settings with larger domains in accordance with some embodiments of the disclosed subject matter.
  • FIG. 12 is an illustrative example of using the rating application to calculate ratings using both temporal aggregation and sitelet aggregation in accordance with some embodiments of the disclosed subject matter.
  • FIG. 13 is a diagram of an illustrative system on which a rating application can be implemented in accordance with some embodiments of the disclosed subject matter.
  • FIG. 14 is a diagram of an illustrative user computer and server as provided, for example, in FIG. 13 in accordance with some embodiments of the disclosed subject matter.
  • DETAILED DESCRIPTION OF THE INVENTION
  • In accordance with some embodiments of the disclosed subject matter, mechanisms for scoring and rating web pages, web sites, and other pieces of content of interest to advertisers or content providers for safe and effective online advertising are provided. These mechanisms, among other things, generate a risk rating that accounts for the inclusion of objectionable content with the use of ordinomials. The risk rating can, in some embodiments, represent the probability that a page or a site contains or will contain objectionable content, the degree of objectionability of the content, and/or any suitable combination thereof. In a more particular example, the risk rating can be determined for a single domain and/or a single category such that a particular piece of media or content can have a rating for each of a number of objectionable content categories. Alternatively, in another more particular example, the risk rating can be determined across several objectionable content categories, across multiple pieces of content (e.g., the pages appearing in the advertiser's traffic), and/or across multiple domains managed by a publisher.
  • In some embodiments, these mechanisms can be generated using multiple statistical models and considering multiple pieces of evidence. In some embodiments, these mechanisms can account for temporal dynamics in content by determining a risk rating that is based on the probability of encountering different severity levels from a given URL and that is based on the types of estimated severity exhibited in the past.
  • These mechanisms can be used in a variety of applications. For example, these mechanisms can provide a rating application that allows advertisers, ad networks, publishers, site managers, and/or other entities to make risk-controlled decisions based at least in part on risk associated with a given webpage, website, or any other suitable content (generally referred to herein as a “webpage” or “page”). In another example, these mechanisms can be provide a rating application that allows advertisers, agencies, advertisement networks, advertisement exchanges, and/or publishers to determine whether to place a particular advertisement on a particular web page based on a high confidence that the page does not contain objectionable content. In a more particular example, these mechanisms allow an advertiser to designate that an advertisement should not be placed on a web page unless a particular confidence (e.g., high confidence, medium-high confidence, etc.) is achieved. In such an example, the particular confidence may be determined based on having a severity greater than a particular severity group in a particular category. In another example, advertisers, agencies, advertisement networks, advertisement exchanges, and publishers can request to view a list of pages in their current advertisement network traffic assessed to have the highest risk of objectionable content.
  • It should be noted that there can be several categories of objectionable content that may be of interest. For example, these categories can include content that relates to guns, bombs, and/or ammunition (e.g., sites that describe or provide information on weapons including guns, rifles, bombs, and ammunition, sites that display and/or discuss how to obtain weapons, manufacture of weapons, trading of weapons (whether legal or illegal), sites which describes or offer for sale weapons including guns, ammunition, and/or firearm accessories, etc.). In another example, these categories can include content relating to alcohol (e.g., sites that provide information relating to alcohol, sites that provide recipes for mixing drinks, sites that provide reviews and locations for bars, etc.), drugs (e.g., sites that provide instructions for or information about obtaining, manufacturing, or using illegal drugs), and/or tobacco (e.g., sites that provide information relating to smoking, cigarettes, chewing tobacco, pipes, etc.). In yet another example, these categories can include offensive language (e.g., sites that contain swear words, profanity, hard language, inappropriate phrases and/or expressions), hate speech (e.g., sites that advocate hostility or aggression towards individuals or groups on the basis of race, religion, gender, nationality, or ethnic origin, sites that denigrate others or justifies inequality, sites that purport to use scientific or other approaches to justify aggression, hostility, or denigration), and/or obscenities (e.g., sites that display graphic violence, the infliction of pain, gross violence, and/or other types of excessive violence). In another example, these categories can include adult content (e.g., sites that contain nudity, sex, use of sexual language, sexual references, sexual images, and/or sexual themes). In another example, these categories can include spyware or malicious code (e.g., sites that provide instructions to practice illegal or unauthorized acts of computer crime using technology or computer programming skills, sites that contain malicious code, etc.) or other illegal content (e.g., sites that provide instructions for threatening or violating the security of property or the privacy of others, such as theft-related sites, locking picking and burglary-related sites, fraud-related sites).
  • It should be noted that objectionable content on one or more of these webpages can generally be defined as having a severity level worse than (or greater than) bj in a category y. Each category (y) can include various severity groups bj, where j is greater than or equal to 1 through n and n is an integer greater than one. For example, an adult content category can have various severity levels, such as G, PG-13, PG, R, NC-17, and X. In another example, an adult content category and an offensive speech category can be combined to form one category of interest. In yet another example, unlike the adult content category example, a category may not have fine grained severity groups and a binomial distribution can be used. For example, a binomial probability can be used for binary outcome events, where there is typically one positive event (e.g., good, yes, etc.) and one negative event (e.g., bad, no, etc.).
  • FIG. 1 is a diagram showing an example of a process for determining the probability of membership in a severity group for one or more category of objectionable content in accordance with some embodiments of the disclosed subject matter. As shown in FIG. 1, process 100 begins by receiving or reviewing content on a webpage, website, or any other suitable content (generally referred to herein as a “webpage” or “page”) at 110. For example, in some embodiments, a rating application can receive multiple requests to rate a group of webpages or websites. In another example, a rating application can receive, from an advertiser, a list of websites that the advertiser is interested in placing an advertisement provided that each of these websites does not contain or does not have a high likelihood of containing objectionable content. In yet another example, a rating application can receive, from an advertiser, that advertiser's current advertisement network traffic for assessment.
  • In response to receiving one or more webpages, the rating application or a component of the rating application selects a uniform resource locator (URL) for rating at 120. In another example, the rating application can receive one or more requests from other components (e.g., the most popular requests are assigned a higher priority, particular components of the rating application are assigned a higher priority, or random selection from the requests). In yet another example, a fixed, prioritized list of URLs can be defined based, for example, on ad traffic or any other suitable input (e.g., use of the rating for scoring, use of the rating for active learning, etc.).
  • One or more pieces of evidence can be extracted from the uniform resource locator or page at 130. These pieces of evidence can include, for example, the text of the URL, image analysis, HyperText Markup Language (HTML) source code, site or domain registration information, ratings, categories, and/or labeling from partner or third party analysis systems (e.g., site content categories), source information of the images on the page, page text or any other suitable semantic analysis of the page content, metadata associated with the page, anchor text on other pages that point to the page of interest, ad network links and advertiser information taken from a page, hyperlink information, malicious code and spyware databases, site traffic volume data, micro-outsourced data, any suitable auxiliary derived information (e.g., ad-to-content ratio), and/or any other suitable combination thereof. As described herein, evidence and/or any other suitable information relating to the page can be collected, extracted, and/or derived using one or more evidentiary sources.
  • Approaches for collecting and analyzing various pieces of evidence for generating a risk rating are further described in, for example, above-referenced U.S. patent application Ser. No. 12/859,763, filed Aug. 19, 2010, which is hereby incorporated by reference herein in its entirety.
  • To encode the probability of membership in severity group bj, an ordinomial can be generated at 140. For example, a multi-severity classification can be determined by using an ordinomial to encode the probability of membership in an ordered set of one or more severity groups. The ordinomial can be represented as follows:

  • jε[0,J],p(y=b j |x)
  • where y is a variable representing the severity class that page x belongs to. It should be noted that the ordinal nature implies that bi is less severe than bj, when i<j. It should also be noted that ordinomial probabilities can be estimated using any suitable statistical models, such as the ones described herein, and using the evidence derived from the pages.
  • At 150, an ordinomial distribution that includes each generated ordinomial for one or more severity groups can be generated. Accordingly, the cumulative ordinal distribution F can be described as:

  • F(y=b j |x)=Σi=1 j p(y=b i |x)
  • Alternatively, unlike the adult content category example described above, a category may not have fine grained severity groups and a binomial distribution can be used. For example, a binomial probability can be used for binary outcome events, where there is typically one positive event (e.g., good, yes, etc.) and one negative event (e.g., bad, no, etc.). At 160, in some embodiments, a binary or binomial-probability determination of appropriateness or objectionability can be projected onto an ordinomial by considering the extreme classes—b1 and bn. For example, in cases where a large spectrum of severity groups may not be present, such as malware, a binomial determination can be performed, where the extreme classes include one positive class (e.g., malware is present in the content) and one negative class (e.g., malware is not present in the content). Ordinomial probabilities can be estimated using one or more statistical models, for example, from evidence derived or extracted from the received web pages.
  • It should be noted that, in process 100 of FIG. 1 and other processes described herein, some steps can be added, some steps may be omitted, the order of the steps may be rearranged, and/or some steps may be performed simultaneously.
  • In some embodiments, multiple ordinomials can be generated from a variety of different statistical models based on a diverse range of evidence. For example, different pieces of evidence can be accounted for in the determination of an ordinomials. These ordinomial estimates can be combined into a posterior ordinomial estimate using, for example, ensemble approaches and information fusion approaches. In a more particular example, example aggregation approaches include weighted averaging, AdaBoost-type mixing, or using sub-ordinomials as covariates in a secondary model. Accordingly, as shown in FIG. 2, this can be represented as:

  • p(y=b i |x)=f(p 0(y=b i |x), . . . ,p m(y=b i |x))
  • for predictive models 1 through m.
  • It should also be noted that, as web pages change over time, the rating application can account for such temporal dynamics. With the dynamics of web pages, subsequent estimates of posterior ordinomials can provide different results. The rating application accounts for these temporal dynamics, where the output can be based on the probability of encountering different severity levels from a given URL based on the type of estimated severity exhibited in the past. Given p(y=bi|x), the estimated posterior ordinomials at time tv can be estimated using, for example, Bayesian combination, techniques of ensemble modeling, exponential discounting over time, conditional random fields, hidden Markov models, and/or any other techniques that explicitly account for time differences. More particularly, one suitable technique can provide various weights based on data age.
  • In some embodiments, the rating application can provide temporal aggregation features to account for the change to web pages over time. FIG. 3 provides an illustrative example of temporal aggregation of posterior ordinomials in accordance with some embodiments of the disclosed subject matter. As shown, temporal aggregation can be implemented in an efficient and distributed manner using a map reduce paradigm, where the key of reduction is the URL being considered. The posterior ordinomials for all times (t) are aggregated and a final p(y=bi|x) is calculated using the aggregated ordinomials and output.
  • FIG. 4 shows an illustrative example of the map reduction approach (MapReduce) for determining the temporal aggregation of posterior ordinomials in accordance with some embodiments of the disclosed subject matter. As shown, URLs can be used as the key for the reduction phase of the MapReduce process. This has the effect of compiling all samples that belong to a given domain onto a single computer during the reduction. Along with the URL, the ordinomials probabilities and the timestamp denoting the instant the ordinomials probability sample was made are passed. More particularly, as shown in FIG. 4, the posterior ordinomials for a given domain can be sorted based on the timestamp or observation time. Probability estimates can then be performed, where the sorted posterior ordinomials for a given domain are combined and an expected posterior ordinomials is calculated. Depending on the computational nature of the temporal aggregation, this expected ordinomial can be stored for use in future temporal aggregations, thereby alleviating the need for explicit storage of each individual record. Additionally, the reduction phase of this MapReduce process can compute and output a rating as described herein.
  • FIG. 5 is a diagram of an example of a process 500 for generating a rating (R) for a webpage in accordance with some embodiments of the disclosed subject matter. Generally speaking, one or more ratings can be determined for a webpage and its ordinomial probability estimates that encode both severity and confidence. That is, a rating (R) associated with a particular ordinomial, p(y=bi|x) that includes severity and confidence parameters is determined. For example, an advertiser may desire that the rating represents a particular confidence that the page's content is no worse than severity group bj. Alternatively, in another example, an advertiser may desire that the rating encodes the confidence that a particular webpage is no better than a particular severity group.
  • As shown in FIG. 5, process 500 begins by selecting the worst severity in accordance with a user specified confidence parameter (β) at 510. For example, as shown in FIG. 6, starting from the least severe or objectionable category in the ordinomial (b1), the bins of the ordinomial are ascended, maintaining a sum of the probabilities encountered. The bin, bi, where the level of confidence (β) is reached can be represented by:
  • b i = arg min i j = 1 i p ( y = b j x ) β
  • Accordingly, the bin, bi is selected such that the application has at least the level of confidence (β) that the content is no worse than bi.
  • It should be noted that, when a larger confidence parameter (β) is assigned, a smaller probability mass resides in more severe categories is ensured.
  • Referring back to FIG. 5, one or more ratings are generated at 520. The rating application can determine ratings from a given page's ordinomial probability estimates and encodes both severity and confidence. It should be noted that the rating application can assume that ratings are given on a numeric scale that can be divided into ranges Bj, where there is a one-to-one mapping between these ranges and the bj. That is, step 510 of process 500 indicates that there is a particular confidence that a page has severity no worse than bj, and the rating (R) is somewhere in the range Bj. For example, as shown in FIG. 7, the rating scale 700 can be a numeric scale of the numbers 0 through 1000, where 1000 denotes the least severe end or the highly safe portion of the scale. In another example, rating scale 700 can be further divided such that particular portions of rating scale are determined to be the best pages—e.g., ratings falling between 800 and 1000. Accordingly, if a greater than confidence that the page's content is no worse than the best category, then the page's rating falls in the 800-1000 range.
  • Additional features of the rating scale are described further below.
  • To determine the rating (R) within the range, boundaries to the rating range (Bj) and a center (cj) of each bin are defined in the configuration of the application. For example, consider two pages A and B, where page A has 99.9% confidence that the page contains pornography and page B has a confidence of (1−β)+ε that it contains pornography. It should be noted that ε is generally an arbitrarily small number. That is, while page A contains pornography, it cannot be stated with confidence that page B does not contain pornography. Both pages A and B fall in the lowest ratings range. However, the rating application generates a significantly lower rating for page A.
  • It should be noted that, in some embodiments, interior rating ranges for a particular objectionability category can be defined. For example, the rating application can generate one or more ratings that take into account the difference between being uncertain between R rated content and PG rated content, where R and PG are two interior severity levels within the adult content category. In another example, the rating application can generate one or more ratings that take into account the difference between a page having no evidence of X rated content and a page having some small evidence of containing X rating content.
  • The boundaries of rating range Bj can be defined as sj-1 and sj. In addition, a center cj can be defined for each bin. It should be noted that the center for each bin is not necessarily the median of the range. Rather, the center is the rating desired by the application should either all probability reside in this range, or should there be balanced probabilities above and below in accordance with a given level of β assurance. Accordingly, the rating given the chosen bin, bi and the ordinomial encoding of p(y=bj|x) can be represented by:
  • x = c i - ( c i - s i - 1 ) j = 0 i - 1 p ( y = b j x ) β + ( s i - c i ) j = i + 1 n p ( y = b j x ) 1 - β
  • It should be noted that one or more ratings can be generated for one or more objectionable categories. For example, multiple ratings can be generated, where one rating is generated for each selected objectionable content category (e.g., adult content, offensive language, and alcohol).
  • It should also be noted that, in some embodiments, ratings for two or more objectionable categories can be combined to create a combined score. For example, a first rating generated for an adult content category and a second rating generated for an offensive language category can be combined. Alternatively, weights can be assigned to each category such that a higher weight can be assigned to the adult content category and a lower weight can be assigned to the offensive language category. Accordingly, an advertiser or any other suitable user of the rating application can customize the score by assigning weights to one or more categories. That is, a multi-dimensional rating vector can be created that represents, for each site, the distribution of risk of adjacency to objectionable content along different dimensions: guns, bombs and ammunition; alcohol; offensive language; hate speech, tobacco; spyware and malicious code; illegal drugs; adult content, gaming and gambling; entertainment; illegality; and/or obscenity.
  • Additionally or alternatively to generating a rating for a website or a webpage, the rating application can determine a rating for a sitelet. As used herein, a sitelet is a collection or subset of web pages and, more particularly, is often a topically homogeneous portion of a page, such as a topic-oriented subtree of a large site's hierarchical tree structure. For example, “finance.yahoo.com” can receive a rating as a sitelet of the website “yahoo.com.”
  • It should be noted that the rating application can rate sitelets as there are web pages that the rating application has never seen before. However, that does not mean that the rating application has no evidence with which to rate the page. There is substantial rating locality within sitelets. A page from a risky site or sitelet is risky itself. In addition, the rating application can rate sitelets for computational storage efficiency as it may not be necessary to save or store the scores for individual pages if they are not significantly different from the scores for the sitelet. For example, if the ratings for the individual pages that make up website www.foo.com are within a given threshold value (e.g., a 5% difference), the rating application can store a rating for a sitelet (a collection of those individual pages). It should also be noted that sitelet scores can provide additional evidence to the rating computation even when the page has been seen before.
  • It should further be noted that advertising on a website can be an indication of direct financial support of the website. Even if a particular page does not contain objectionable content or is determined to not likely contain objectionable content, an advertiser may not want to support a site that otherwise promotes objectionable categories of content. For example, the rating application can provide an indication when a particular news item promotes or supports a major Nazi website. In another example, aside from the content of a page, the rating application can provide an indication when a particular advertiser that supports or advertises on a particular website falls in an objectionable category. In a more particular example, the rating application can detect whether the content falls within an objectionable category and whether advertisers, promoters, or other entities associated with the content fall within an objectionable category.
  • In the example where sitelets are subtrees in the hierarchical site structure, FIG. 8 shows an illustrative example that incoming URLs can be matched to the sitelet with the longest available shared prefix. The aggregated ordinomials and associated rating of this longest prefix are then used for the query URL. Radix trees can, in some embodiments, be used to make this query computationally efficient.
  • It should be noted that a rating for every URL or sub-string in the file tree implied by a domain's URLs need not be stored explicitly. If the rating for a page or sub-tree is not significantly different from that of its parents, then explicit storage offers little additional benefit at the expense of increased storage and computation. Given a sensitivity parameter or threshold, τ, that expresses the trade-off between sensitivity and storage, the rating application can store ratings for those components of the subtree with:

  • |
    Figure US20140379443A1-20141225-P00001
    (p)−
    Figure US20140379443A1-20141225-P00001
    (c)|≧τ
  • where R() denotes the rating for an entity, c denotes the child page or subtree whose rating is under consideration, and p denotes the parent of child page c.
  • Similar to individual pages, sitelet ratings can be generated from sitelet ordinomials. The sitelet ordinomials can be produced by an aggregation process over the pages in the sitelet. For example, the sitelet ordinomial can be a weighted combination of the page ordinomials, a Bayesian combination, or generated using any suitable explicit mathematical function.
  • FIG. 9 shows an illustrative example of calculating sitelet ordinomials in accordance with some embodiments of the disclosed subject matter. As shown, for calculating the aggregated sitelet ordinomial, the pages in the sitelet can be considered as a large set, or the tree structure can be taken into account explicitly. In the latter case, the calculation can be done efficiently by recursion. The base step is to calculate the rating at the root node. Then, for each step, the ratings for all the children are calculated. For each child, the inequality |
    Figure US20140379443A1-20141225-P00001
    (p)−
    Figure US20140379443A1-20141225-P00001
    (c)|≧τ is evaluated. It should be noted that, in this embodiment, p represents the most closely neighboring super-node in the file tree to c that has been isolated as a sitelet and given an explicitly stored rating. Children that return true for this inequality are stored explicitly as their own sitelet and subjected to further recursion.
  • In some embodiments, sitelet ordinomials can be efficiently calculated using a map reduction process in accordance with some embodiments of the disclosed subject matter. For example, as shown in FIG. 10, in settings with small domains (e.g., those where processing can occur comfortably in a single reducer machine), the rating application can generate ratings using a single pass via MapReduce or any other suitable mapping approach. The reduction phase is performed using the domain as a key. Once the URLs belonging to a domain are assembled together, a file tree or domain tree can be generated, and the above-mentioned calculation of sitelet ordinomials can be used to find pertinent ratings in a domain.
  • Alternatively, FIG. 11 shows that sitelet ordinomials can be efficiently calculated using a map reduction process for settings with larger domains. In such cases, the reduction via MapReduce can occur iteratively. Let M denote the number of suffixes in the largest domain. Then for t from M to 1, combine all ordinomials at level t in accordance to the inequality |
    Figure US20140379443A1-20141225-P00001
    (p)−
    Figure US20140379443A1-20141225-P00001
    (c)|≧τ. In cases where the inequality returns false, all children are stored for rating and sitelet computation at higher levels. This may lead to an unacceptable demand for memory and resources. To alleviate this demand, children with the same or very similar ratings can be combined using explicit combination functions, for example, Bayesian or weighted averaging. Those children that have a difference in rating of at least τ are stored explicitly as their own sitelet rating. Each step reduces t by one: t←t−1. This is repeated until t is equivalent to 1, where the rating and sitelet are calculated and stored to ensure all URLs present in a domain receive some rating.
  • In some embodiments, the rating application can calculate ratings using both temporal aggregation and sitelet aggregation. Generally speaking, the rating application accomplishes this by performing the temporal aggregation on URLs at the first step of sitelet aggregation. For example, as shown in FIG. 12, the rating application can aggregate posterior ordinomials for all times (t), a reduction phase is performed using the domain as a key, and, once the URLs belonging to a domain are assembled together, a file tree or domain tree can be generated. The expected ordinomial for each URL can then be calculated.
  • In accordance with some embodiments of the disclosed subject matter, mechanisms are provided for evaluating the quality of collections of online media and other suitable content. Because online media is often purchased by advertisers at different levels of granularity (e.g., ranging from individual pages to large sets of domains), it is desirable to develop metrics for comparing the quality of such diverse sets of content. More particularly, these mechanisms, among other things, collect individual content ratings, aggregate these ratings across arbitrary subsets, normalize these ratings to be on a general index scale, and calibrate the normalized ratings such that the global mean provides a benchmark for comparison.
  • Generally speaking, the application calculates several metrics for particular content (e.g., media, web pages, etc.). For example, in the case of objectionable content, a category can have metrics encapsulating the risk related to the appearance of adult content, metrics encapsulating the risk related to the appearance or use of hate speech, etc. Accordingly, in some embodiments, the application can provide a single metric encapsulating the different aspects of the content.
  • For example, let xj refer to an individual example of a piece of online media or online content, such as a particular web page, video, or image. Given multiple risk ratings for the piece of content xj—e.g., r(1)(xj), . . . , r(M)(xj) for 1 through M different categories of objectionable content, the multiple risk ratings can be combined into a single concise metric, r(xj), using, for example, a specialized combination function, h, such that:

  • r(x j)=h(r (1)(x j), . . . ,r (M)(x j))
  • In a more particular example, example combination functions include weighted averaging, where the weights are set to the importance of particular objectionable content categories, Bayesian mixing, a secondary combining model, and/or a simple minimum function that determines the most risky category in the case of a brand safety model. As described above, multiple combining functions can also be used and aggregated to create the single concise metric.
  • The single concise metric can, for example, be used to compare diverse sets of content. In a more particular example, the application can allow the advertiser to compare the content management by two different advertising networks.
  • Note that, in some embodiments, r(•) can be ordinal, where r(•)ε{V0, . . . , Vd}, such that without loss of generality, V0<V1< . . . <Vd. Additionally or alternatively, the ratings r(•) can also be real-valued, where (•)ε
    Figure US20140379443A1-20141225-P00002
    . As used herein, r(xj) can provide a measure that includes both the quality (or severity) of xj, and the confidence that xj deserves that level of quality. That is, the rating application can provide a rating r(•) that combines both the likelihood and the severity of content considered.
  • Generally speaking, online media is often packaged into arbitrary collections when being traded in the online advertising marketplace. Additionally, natural boundaries may exist, segregating a collection of content into distinct subsets. Given a rating defined on individual examples in this content space, r(•), it can be desirable to combine the ratings on individual pages into aggregate ratings denoting the expected rating of an entire subset of content. Let X denote a collection of media, for example, the media holdings of an online publisher having a particular category of web pages, such as pages related to sports, of the pages offered by a supply-side advertising network, including any subsets thereof. The rating application can aggregate the ratings of content in this collection, xεX.
  • For ordinal ratings, the aggregation of ratings can be expressed as:

  • r agg=argmaxVεxεXΠ(r(x)=V).
  • It should be noted that, in the above-mentioned equation, Π(•) is an indicator function that takes the value 1 when the operand is true, and zero otherwise. This corresponds to the most common ordinal value in the collection. It should also be noted that ties may be broken arbitrarily, for example, by choosing the most severe category in the tie, for safety.
  • For real-value ratings, the aggregation of ratings can be expressed as:
  • r agg = 1 X x X r ( x ) .
  • It should be noted that, in the above-mentioned equation, |X| is the number of examples in X. It should also be noted that aggregation corresponds to the arithmetic mean of individual content ratings.
  • When aggregating content ratings, the rating application considers that content may be presented in a pre-aggregated form. For example, the input may be domains, each with an aggregate rating. Formally, let Y1 be a collection of one or more examples of content, xεYl Let X then be extended to be a collection of such collections, YlεX. Rating aggregation can then be extended to such sub-aggregations of content.
  • For ordinal ratings, the aggregation of ratings can be expressed as:

  • r agg=argmaxVΣY l εX |Y l|Π(r agg(Y)=V),
  • where |Y| is the number of examples in Y.
  • For real-value ratings, the aggregation of ratings can be expressed as:
  • r agg = 1 X Y l X Y l r agg ( Y ) ,
  • where |X| is extended to be ΣY l εXΣxεY l Π(xεYl), the count of all examples in all subsets.
  • It should be noted that the aggregate ratings on pre-collected online media are recursive. That is, content ratings can be aggregated on collections of collections of collections, etc.
  • In some embodiments, the rating application takes unconstrained, real-valued ratings and projects them onto a bounded region of the number line for ease of comparison. This mapping to the number value assigned to each ordinal category can be constructed to capture the risk and severity profiles of content in each respective category.
  • For example, the rating application can be configured to define an index-scaled rating to be a numerical rating assigned to online media constrained to the range ri(x)ε[α,β]. This rating is assumed to capture both the severity and risk of appearance of online media, with ri(xj)≧ri(xk) implying that xk, is at least a risky as xj—there is a greater chance of riskier content appearing on xk than on xj. This implies that xj is likely to be safer for brand advertisers or other online media buyers. The values of α and β can be set arbitrarily. For example, β=0 and β=200 can be provided for the scale.
  • Given an index-scaled rating, ri(x), on a particular example x, the rating application defines a mapping from an unscaled rating to a scaled rating for both ordinal and real-valued ratings into ri(x).
  • For ordinal ratings, the index-scaled rating can be expressed as:

  • r i(x j)=a r(x j )
  • It should be noted that the mapping to an index-scaled rating is performed by assigning a constant, a, to each ordinal non-index scaled rating. As mentioned above, in the ordinal setting, r(•)ε{V0, . . . , Vd}. Here, as each aε[α,β] (e.g., a is bounded by the index-scaled rating range and without loss of generality aV m <aV n whenever Vm<Vn), more risky ordinal categories have lower numerical values in the mapping.
  • For real-valued ratings, the index-scaled rating can be expressed as:

  • r i(x j)=f(r(x j))
  • Here, f (•) is a monotonic function. For example, f(r(xj))≦f(r(xk)) whenever r(xj)≦r(xk). That is, lower unscaled ratings tend to get lower scaled ratings. Additionally, it should be noted that, the range of f(•) is [α,β].
  • Accordingly, the rating application can transform arbitrary raw ratings into an index-scaled rating. Such a numerical rating can encode the likelihood of encountering risky or inappropriate content on a given example of online media, in addition to the likely severity of such content. The resulting index-scaled rating represents the value of online content to buyers and advertisers, with risky and severely inappropriate content generally being of low value.
  • Because a single example of an advertisement appearing on severely inappropriate content (e.g., pornography or hate speech) may have harsh consequences for the advertiser placing the advertisement, individual examples of inappropriate content may have a disproportionate influence in the aggregate rating of collected content. That is, even a few severely inappropriate pages in a site containing thousands of examples may bring the aggregate rating down significantly. In order to capture this disproportionate influence of inappropriate content, the rating application can be configured to aggregate ratings for collected content, xεXm with commensurate impact of riskier individual pages. This can be represented as follows:
  • r i , agg = x X r i ( x ) x X w ( r i ( x ) )
  • It should be noted that w(•)ε[1,∞) represents a weight function associated with a content rating. More particularly, content that is riskier receives both a lower numerical rating and contributes to a higher total weight, thereby lowering the expected score via a lower denominator. For example, assume that the rating application creates four risk buckets—e.g., very high risk, high risk, moderate risk, and low risk, each with ranges of an index-scaled rating. For a given aggregation of content, the rating application also denotes the number of examples in each by r1, r2, r3, and r4, respectively. The rating application can also assign a native index-scaled rating to each bucket. For example, the rating application can assign 50, 100, 150, and 200 to each bucket, respectively. In addition, the rating application can provide combination weights for each category. For example, the application can assign the combination weights of 35.2, 8.8, 2.2, and 1.0 for each bucket, respectively. Accordingly, a severity weight aggregation of such content can be determined by calculating:
  • r i , agg = 50 r 1 + 100 r 2 + 150 r 3 + 200 r 4 35.2 r 1 + 8.8 r 2 + 2.2 r 3 + 1.0 r 4
  • In some embodiments, the rating application not only considers how a content rates with respect to risk and severity, but also determines how that content compares to other similar content. In order to perform such a comparison, the rating application recalibrates ratings to the mean rating of content being considered. The mean (μr) of the uncalibrated set of ratings can be determined by calculating:
  • μ r = 1 X x X r i ( x )
  • It should be noted that gamma (γ) can denote a value that the mean is mapped after calibration and Yj can denote a subset of content in X. The rating application then defines a calibration of Yj's rating, rc relating to μr using the following cases:
  • r c ( Y j ) = { γ if r i , agg ( Y j ) = μ r γ r i , agg ( Y j ) - α μ r + α if r i , agg ( Y j ) < μ r γ r i , agg ( Y j ) - γ μ r + γ if r i , agg ( Y j ) > μ r
  • For example, in the above-mentioned case, where α=0 and β=200 and let γ=100. The re-calibration can be performed by determining:
  • r c ( Y j ) = { 100 if r i , agg ( Y j ) = μ r 100 r i , agg ( Y j ) μ r if r i , agg ( Y j ) < μ r 100 r i , agg ( Y j ) - 100 μ r + 100 if r i , agg ( Y j ) > μ r
  • FIG. 13 is a generalized schematic diagram of a system 1300 on which the rating application may be implemented in accordance with some embodiments of the disclosed subject matter. As illustrated, system 1300 may include one or more user computers 1302. User computers 1302 may be local to each other or remote from each other. User computers 1302 are connected by one or more communications links 1304 to a communications network 1306 that is linked via a communications link 1308 to a server 1310.
  • System 1300 may include one or more servers 1310. Server 1310 may be any suitable server for providing access to the application, such as a processor, a computer, a data processing device, or a combination of such devices. For example, the application can be distributed into multiple backend components and multiple frontend components or interfaces. In a more particular example, backend components, such as data collection and data distribution can be performed on one or more servers 1310. Similarly, the graphical user interfaces displayed by the application, such as a data interface and an advertising network interface, can be distributed by one or more servers 1310 to user computer 1302.
  • More particularly, for example, each of the client 1302 and server 1310 can be any of a general purpose device such as a computer or a special purpose device such as a client, a server, etc. Any of these general or special purpose devices can include any suitable components such as a processor (which can be a microprocessor, digital signal processor, a controller, etc.), memory, communication interfaces, display controllers, input devices, etc. For example, client 1302 can be implemented as a personal computer, a personal data assistant (PDA), a portable email device, a multimedia terminal, a mobile telephone, a set-top box, a television, etc.
  • In some embodiments, any suitable computer readable media can be used for storing instructions for performing the processes described herein, can be used as a content distribution that stores content and a payload, etc. For example, in some embodiments, computer readable media can be transitory or non-transitory. For example, non-transitory computer readable media can include media such as magnetic media (such as hard disks, floppy disks, etc.), optical media (such as compact discs, digital video discs, Blu-ray discs, etc.), semiconductor media (such as flash memory, electrically programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), etc.), any suitable media that is not fleeting or devoid of any semblance of permanence during transmission, and/or any suitable tangible media. As another example, transitory computer readable media can include signals on networks, in wires, conductors, optical fibers, circuits, any suitable media that is fleeting and devoid of any semblance of permanence during transmission, and/or any suitable intangible media.
  • Referring back to FIG. 13, communications network 1306 may be any suitable computer network including the Internet, an intranet, a wide-area network (“WAN”), a local-area network (“LAN”), a wireless network, a digital subscriber line (“DSL”) network, a frame relay network, an asynchronous transfer mode (“ATM”) network, a virtual private network (“VPN”), or any combination of any of such networks. Communications links 1304 and 1308 may be any communications links suitable for communicating data between user computers 1302 and server 1310, such as network links, dial-up links, wireless links, hard-wired links, any other suitable communications links, or a combination of such links. User computers 1302 enable a user to access features of the application. User computers 1302 may be personal computers, laptop computers, mainframe computers, dumb terminals, data displays, Internet browsers, personal digital assistants (“PDAs”), two-way pagers, wireless terminals, portable telephones, any other suitable access device, or any combination of such devices. User computers 1302 and server 1310 may be located at any suitable location. In one embodiment, user computers 1302 and server 1310 may be located within an organization. Alternatively, user computers 1302 and server 1310 may be distributed between multiple organizations.
  • Referring back to FIG. 13, the server and one of the user computers depicted in FIG. 13 are illustrated in more detail in FIG. 14. Referring to FIG. 14, user computer 1302 may include processor 1402, display 1404, input device 1406, and memory 1408, which may be interconnected. In a preferred embodiment, memory 1408 contains a storage device for storing a computer program for controlling processor 1402.
  • Processor 1402 uses the computer program to present on display 1404 the application and the data received through communications link 1304 and commands and values transmitted by a user of user computer 1302. It should also be noted that data received through communications link 1304 or any other communications links may be received from any suitable source. Input device 1406 may be a computer keyboard, a cursor-controller, dial, switchbank, lever, or any other suitable input device as would be used by a designer of input systems or process control systems.
  • Server 1310 may include processor 1420, display 1422, input device 1424, and memory 1426, which may be interconnected. In a preferred embodiment, memory 1426 contains a storage device for storing data received through communications link 1308 or through other links, and also receives commands and values transmitted by one or more users. The storage device further contains a server program for controlling processor 1420.
  • In some embodiments, the application may include an application program interface (not shown), or alternatively, the application may be resident in the memory of user computer 1302 or server 1310. In another suitable embodiment, the only distribution to user computer 1302 may be a graphical user interface (“GUI”) which allows a user to interact with the application resident at, for example, server 1310.
  • In one particular embodiment, the application may include client-side software, hardware, or both. For example, the application may encompass one or more Web-pages or Web-page portions (e.g., via any suitable encoding, such as HyperText Markup Language (“HTML”), Dynamic HyperText Markup Language (“DHTML”), Extensible Markup Language (“XML”), JavaServer Pages (“JSP”), Active Server Pages (“ASP”), Cold Fusion, or any other suitable approaches).
  • Although the application is described herein as being implemented on a user computer and/or server, this is only illustrative. The application may be implemented on any suitable platform (e.g., a personal computer (“PC”), a mainframe computer, a dumb terminal, a data display, a two-way pager, a wireless terminal, a portable telephone, a portable computer, a palmtop computer, an H/PC, an automobile PC, a laptop computer, a cellular phone, a personal digital assistant (“PDA”), a combined cellular phone and PDA, etc.) to provide such features.
  • It will also be understood that the detailed description herein may be presented in terms of program procedures executed on a computer or network of computers. These procedural descriptions and representations are the means used by those skilled in the art to most effectively convey the substance of their work to others skilled in the art.
  • A procedure is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. These steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared and otherwise manipulated. It proves convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like. It should be noted, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities.
  • Further, the manipulations performed are often referred to in terms, such as adding or comparing, which are commonly associated with mental operations performed by a human operator. No such capability of a human operator is necessary, or desirable in most cases, in any of the operations described herein which form part of the present invention; the operations are machine operations. Useful machines for performing the operation of the present invention include general purpose digital computers or similar devices.
  • The present invention also relates to apparatus for performing these operations. This apparatus may be specially constructed for the required purpose or it may comprise a general purpose computer as selectively activated or reconfigured by a computer program stored in the computer. The procedures presented herein are not inherently related to a particular computer or other apparatus. Various general purpose machines may be used with programs written in accordance with the teachings herein, or it may prove more convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these machines will appear from the description given.
  • Accordingly, methods, systems, and media for applying scores and ratings to web pages, web sites, and other pieces of content of interest to advertisers or content providers for safe and effective online are provided.
  • It is to be understood that the invention is not limited in its application to the details of construction and to the arrangements of the components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments and of being practiced and carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting.
  • Although the invention has been described and illustrated in the foregoing illustrative embodiments, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the details of implementation of the invention can be made without departing from the spirit and scope of the invention. Features of the disclosed embodiments can be combined and rearranged in various ways.

Claims (21)

1. A method for rating webpages for safe publication, the method comprising:
extracting features from a web page;
applying, using a hardware processor, a plurality of statistical models to the extracted features to generate a plurality of ordinomial estimates, wherein each ordinomial estimate represents a probability that the web page is a member of one of a plurality of severity groups;
generating a rating based on the plurality of ordinomial estimates; and
determining whether a content item should be published on the web page based on the rating.
2. The method of claim 1, wherein the content item is advertisement content.
3. The method of claim 1, further comprising determining a posterior ordinomial estimate for the web page by combining the plurality of ordinomial estimates.
4. The method of claim 3, further comprising:
determining a plurality of posterior ordinomial estimates at a plurality of times for the web page; and
determining an expected posterior ordinomial estimate by combining the plurality of posterior ordinomial estimates over the plurality of times.
5. The method of claim 3, further comprising:
determining that the web page belongs to a sitelet, wherein the sitelet includes at least one of a plurality of web pages;
determining a sitelet ordinomial by aggregating the plurality of posterior ordinomial estimates associated with the one or more web pages belonging to the sitelet; and
generating a sitelet rating based on the aggregated plurality of posterior ordinomials.
6. The method of claim 1, further comprising:
collecting a plurality of ratings associated with a plurality of web pages, wherein the plurality of ratings includes ordinal ratings and real-valued ratings;
determining an aggregate rating for the plurality of web pages based on the collected plurality of ratings;
normalizing the aggregate rating by mapping the aggregate rating to an index-scaled rating;
applying a severity weight to the index-scaled rating; and
generating a severity-weighted index-scaled rating for the plurality of web pages.
7. The method of claim 1, further comprising generating a combined rating by combining the generated rating that encodes whether the web page is likely to contain content of a first category with a second rating that encodes whether the web page is likely to contain content of a second category.
8. A system for rating webpages for safe publication, the system comprising:
a hardware processor that:
extracts features from a web page;
applies a plurality of statistical models to the extracted features to generate a plurality of ordinomial estimates, wherein each ordinomial estimate represents a probability that the web page is a member of one of a plurality of severity groups;
generates a rating based on the plurality of ordinomial estimates; and
determines whether a content item should be published on the web page based on the rating.
9. The system of claim 8, wherein the content item is advertisement content.
10. The system of claim 8, wherein the hardware processor is further configured to determine a posterior ordinomial estimate for the web page by combining the plurality of ordinomial estimates.
11. The system of claim 10, wherein the hardware processor is further configured to:
determine a plurality of posterior ordinomial estimates at a plurality of times for the web page; and
determine an expected posterior ordinomial estimate by combining the plurality of posterior ordinomial estimates over the plurality of times.
12. The system of claim 10, wherein the hardware processor is further configured to:
determine that the web page belongs to a sitelet, wherein the sitelet includes at least one of a plurality of web pages;
determine a sitelet ordinomial by aggregating the plurality of posterior ordinomial estimates associated with the one or more web pages belonging to the sitelet; and
generate a sitelet rating based on the aggregated plurality of posterior ordinomials.
13. The system of claim 8, wherein the hardware processor is further configured to:
collect a plurality of ratings associated with a plurality of web pages, wherein the plurality of ratings includes ordinal ratings and real-valued ratings;
determine an aggregate rating for the plurality of web pages based on the collected plurality of ratings;
normalize the aggregate rating by mapping the aggregate rating to an index-scaled rating;
apply a severity weight to the index-scaled rating; and
generate a severity-weighted index-scaled rating for the plurality of web pages.
14. The system of claim 8, wherein the hardware processor is further configured to generate a combined rating by combining the generated rating that encodes whether the web page is likely to contain content of a first category with a second rating that encodes whether the web page is likely to contain content of a second category.
15. A non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for rating webpages for safe publication, the method comprising:
16. The non-transitory computer-readable medium of claim 15, wherein the content item is advertisement content.
17. The non-transitory computer-readable medium of claim 15, wherein the method further comprises determining a posterior ordinomial estimate for the web page by combining the plurality of ordinomial estimates.
18. The non-transitory computer-readable medium of claim 17, wherein the method further comprises:
determining a plurality of posterior ordinomial estimates at a plurality of times for the web page; and
determining an expected posterior ordinomial estimate by combining the plurality of posterior ordinomial estimates over the plurality of times.
19. The non-transitory computer-readable medium of claim 17, wherein the method further comprises:
determining that the web page belongs to a sitelet, wherein the sitelet includes at least one of a plurality of web pages;
determining a sitelet ordinomial by aggregating the plurality of posterior ordinomial estimates associated with the one or more web pages belonging to the sitelet; and
generating a sitelet rating based on the aggregated plurality of posterior ordinomials.
20. The non-transitory computer-readable medium of claim 15, wherein the method further comprises:
collecting a plurality of ratings associated with a plurality of web pages, wherein the plurality of ratings includes ordinal ratings and real-valued ratings;
determining an aggregate rating for the plurality of web pages based on the collected plurality of ratings;
normalizing the aggregate rating by mapping the aggregate rating to an index-scaled rating;
applying a severity weight to the index-scaled rating; and
generating a severity-weighted index-scaled rating for the plurality of web pages.
21. The non-transitory computer-readable medium of claim 15, wherein the method further comprises generating a combined rating by combining the generated rating that encodes whether the web page is likely to contain content of a first category with a second rating that encodes whether the web page is likely to contain content of a second category.
US14/184,264 2010-06-01 2014-02-19 Methods, systems, and media for applying scores and ratings to web pages,web sites, and content for safe and effective online advertising Pending US20140379443A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/184,264 US20140379443A1 (en) 2010-06-01 2014-02-19 Methods, systems, and media for applying scores and ratings to web pages,web sites, and content for safe and effective online advertising

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US35039310P 2010-06-01 2010-06-01
US12/859,763 US20110047006A1 (en) 2009-08-21 2010-08-19 Systems, methods, and media for rating websites for safe advertising
US201161431789P 2011-01-11 2011-01-11
US13/151,146 US8732017B2 (en) 2010-06-01 2011-06-01 Methods, systems, and media for applying scores and ratings to web pages, web sites, and content for safe and effective online advertising
US14/184,264 US20140379443A1 (en) 2010-06-01 2014-02-19 Methods, systems, and media for applying scores and ratings to web pages,web sites, and content for safe and effective online advertising

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US13/151,146 Continuation US8732017B2 (en) 2010-06-01 2011-06-01 Methods, systems, and media for applying scores and ratings to web pages, web sites, and content for safe and effective online advertising

Publications (1)

Publication Number Publication Date
US20140379443A1 true US20140379443A1 (en) 2014-12-25

Family

ID=45439233

Family Applications (2)

Application Number Title Priority Date Filing Date
US13/151,146 Active 2032-03-17 US8732017B2 (en) 2010-06-01 2011-06-01 Methods, systems, and media for applying scores and ratings to web pages, web sites, and content for safe and effective online advertising
US14/184,264 Pending US20140379443A1 (en) 2010-06-01 2014-02-19 Methods, systems, and media for applying scores and ratings to web pages,web sites, and content for safe and effective online advertising

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US13/151,146 Active 2032-03-17 US8732017B2 (en) 2010-06-01 2011-06-01 Methods, systems, and media for applying scores and ratings to web pages, web sites, and content for safe and effective online advertising

Country Status (1)

Country Link
US (2) US8732017B2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180253661A1 (en) * 2017-03-03 2018-09-06 Facebook, Inc. Evaluating content for compliance with a content policy enforced by an online system using a machine learning model determining compliance with another content policy
CN109101502A (en) * 2017-06-20 2018-12-28 阿里巴巴集团控股有限公司 A kind of flow configuration method, switching method and the device of the page

Families Citing this family (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9990674B1 (en) 2007-12-14 2018-06-05 Consumerinfo.Com, Inc. Card registry systems and methods
US8312033B1 (en) 2008-06-26 2012-11-13 Experian Marketing Solutions, Inc. Systems and methods for providing an integrated identifier
US8060424B2 (en) 2008-11-05 2011-11-15 Consumerinfo.Com, Inc. On-line method and system for monitoring and reporting unused available credit
US9195990B2 (en) 2010-06-02 2015-11-24 Integral Ad Science, Inc. Methods, systems, and media for reviewing content traffic
US9483606B1 (en) 2011-07-08 2016-11-01 Consumerinfo.Com, Inc. Lifescore
US9311599B1 (en) 2011-07-08 2016-04-12 Integral Ad Science, Inc. Methods, systems, and media for identifying errors in predictive models using annotators
US9223888B2 (en) * 2011-09-08 2015-12-29 Bryce Hutchings Combining client and server classifiers to achieve better accuracy and performance results in web page classification
US9106691B1 (en) 2011-09-16 2015-08-11 Consumerinfo.Com, Inc. Systems and methods of identity protection and management
US8738516B1 (en) 2011-10-13 2014-05-27 Consumerinfo.Com, Inc. Debt services candidate locator
US9014717B1 (en) * 2012-04-16 2015-04-21 Foster J. Provost Methods, systems, and media for determining location information from real-time bid requests
US9853959B1 (en) 2012-05-07 2017-12-26 Consumerinfo.Com, Inc. Storage and maintenance of personal data
US10387911B1 (en) 2012-06-01 2019-08-20 Integral Ad Science, Inc. Systems, methods, and media for detecting suspicious activity
US9552590B2 (en) 2012-10-01 2017-01-24 Dstillery, Inc. Systems, methods, and media for mobile advertising conversion attribution
US9654541B1 (en) 2012-11-12 2017-05-16 Consumerinfo.Com, Inc. Aggregating user web browsing data
US9916621B1 (en) 2012-11-30 2018-03-13 Consumerinfo.Com, Inc. Presentation of credit score factors
US11068931B1 (en) 2012-12-10 2021-07-20 Integral Ad Science, Inc. Systems, methods, and media for detecting content viewability
US10102570B1 (en) 2013-03-14 2018-10-16 Consumerinfo.Com, Inc. Account vulnerability alerts
US9406085B1 (en) 2013-03-14 2016-08-02 Consumerinfo.Com, Inc. System and methods for credit dispute processing, resolution, and reporting
US10497030B1 (en) 2013-03-15 2019-12-03 Integral Ad Science, Inc. Methods, systems, and media for enhancing a blind URL escrow with real time bidding exchanges
US10482477B2 (en) * 2013-03-15 2019-11-19 Netflix, Inc. Stratified sampling applied to A/B tests
US10685398B1 (en) 2013-04-23 2020-06-16 Consumerinfo.Com, Inc. Presenting credit score information
CN104123328A (en) * 2013-04-28 2014-10-29 北京千橡网景科技发展有限公司 Method and device used for inhibiting spam comments in website
US9477737B1 (en) 2013-11-20 2016-10-25 Consumerinfo.Com, Inc. Systems and user interfaces for dynamic access of multiple remote databases and synchronization of data based on user rules
US10963470B2 (en) 2017-09-06 2021-03-30 Siteimprove A/S Website scoring system
CN108313374A (en) * 2017-12-28 2018-07-24 芜湖瑞思机器人有限公司 A kind of Aloe Vera Gel boxing device
CN108284981A (en) * 2017-12-28 2018-07-17 芜湖瑞思机器人有限公司 A kind of Aloe Vera Gel mounted box method
CN108298137A (en) * 2017-12-28 2018-07-20 芜湖瑞思机器人有限公司 A kind of Aloe Vera Gel boxing device front delivery mechanism
CN108298138A (en) * 2017-12-28 2018-07-20 芜湖瑞思机器人有限公司 A kind of Aloe Vera Gel boxing device send case structure
US20200074541A1 (en) 2018-09-05 2020-03-05 Consumerinfo.Com, Inc. Generation of data structures based on categories of matched data items
US11315179B1 (en) 2018-11-16 2022-04-26 Consumerinfo.Com, Inc. Methods and apparatuses for customized card recommendations
US11238656B1 (en) 2019-02-22 2022-02-01 Consumerinfo.Com, Inc. System and method for an augmented reality experience via an artificial intelligence bot
US11544653B2 (en) * 2019-06-24 2023-01-03 Overstock.Com, Inc. System and method for improving product catalog representations based on product catalog adherence scores
US11941065B1 (en) 2019-09-13 2024-03-26 Experian Information Solutions, Inc. Single identifier platform for storing entity data
US11055208B1 (en) 2020-01-07 2021-07-06 Allstate Insurance Company Systems and methods for automatically assessing and conforming software development modules to accessibility guidelines in real-time
US11836439B2 (en) 2021-11-10 2023-12-05 Siteimprove A/S Website plugin and framework for content management services
US11397789B1 (en) 2021-11-10 2022-07-26 Siteimprove A/S Normalizing uniform resource locators
US11461430B1 (en) 2021-11-10 2022-10-04 Siteimprove A/S Systems and methods for diagnosing quality issues in websites
US11461429B1 (en) 2021-11-10 2022-10-04 Siteimprove A/S Systems and methods for website segmentation and quality analysis
US11687613B2 (en) 2021-11-12 2023-06-27 Siteimprove A/S Generating lossless static object models of dynamic webpages
US11468058B1 (en) 2021-11-12 2022-10-11 Siteimprove A/S Schema aggregating and querying system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060179053A1 (en) * 2005-02-04 2006-08-10 Microsoft Corporation Improving quality of web search results using a game
US20080320010A1 (en) * 2007-05-14 2008-12-25 Microsoft Corporation Sensitive webpage content detection
US8589391B1 (en) * 2005-03-31 2013-11-19 Google Inc. Method and system for generating web site ratings for a user

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6643641B1 (en) 2000-04-27 2003-11-04 Russell Snyder Web search engine with graphic snapshots
US7284008B2 (en) * 2000-08-30 2007-10-16 Kontera Technologies, Inc. Dynamic document context mark-up technique implemented over a computer network
US20030236721A1 (en) 2002-05-21 2003-12-25 Plumer Edward S. Dynamic cost accounting
US7392474B2 (en) * 2004-04-30 2008-06-24 Microsoft Corporation Method and system for classifying display pages using summaries
US7788132B2 (en) * 2005-06-29 2010-08-31 Google, Inc. Reviewing the suitability of Websites for participation in an advertising network
US9286388B2 (en) * 2005-08-04 2016-03-15 Time Warner Cable Enterprises Llc Method and apparatus for context-specific content delivery
US8769673B2 (en) * 2007-02-28 2014-07-01 Microsoft Corporation Identifying potentially offending content using associations

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060179053A1 (en) * 2005-02-04 2006-08-10 Microsoft Corporation Improving quality of web search results using a game
US8589391B1 (en) * 2005-03-31 2013-11-19 Google Inc. Method and system for generating web site ratings for a user
US20080320010A1 (en) * 2007-05-14 2008-12-25 Microsoft Corporation Sensitive webpage content detection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
How to Write Advertisements that Sell, author unknown, from System, the magazine of Business, dated 1912, downloaded from http://library.duke.edu/digitalcollections/eaa_Q0050/ on 21 February 2015 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180253661A1 (en) * 2017-03-03 2018-09-06 Facebook, Inc. Evaluating content for compliance with a content policy enforced by an online system using a machine learning model determining compliance with another content policy
US11023823B2 (en) * 2017-03-03 2021-06-01 Facebook, Inc. Evaluating content for compliance with a content policy enforced by an online system using a machine learning model determining compliance with another content policy
CN109101502A (en) * 2017-06-20 2018-12-28 阿里巴巴集团控股有限公司 A kind of flow configuration method, switching method and the device of the page

Also Published As

Publication number Publication date
US8732017B2 (en) 2014-05-20
US20120010927A1 (en) 2012-01-12

Similar Documents

Publication Publication Date Title
US8732017B2 (en) Methods, systems, and media for applying scores and ratings to web pages, web sites, and content for safe and effective online advertising
US11868375B2 (en) Method, medium, and system for personalized content delivery
US11061946B2 (en) Systems and methods for cross-media event detection and coreferencing
US20110047006A1 (en) Systems, methods, and media for rating websites for safe advertising
US8812494B2 (en) Predicting content and context performance based on performance history of users
US9223849B1 (en) Generating a reputation score based on user interactions
JP6130609B2 (en) Client-side search templates for online social networks
US9324112B2 (en) Ranking authors in social media systems
US8412648B2 (en) Systems and methods of making content-based demographics predictions for website cross-reference to related applications
US20130124653A1 (en) Searching, retrieving, and scoring social media
US9311599B1 (en) Methods, systems, and media for identifying errors in predictive models using annotators
US8732015B1 (en) Social media pricing engine
US20120016642A1 (en) Contextual-bandit approach to personalized news article recommendation
US20120102027A1 (en) Compatibility Scoring of Users in a Social Network
US20160239865A1 (en) Method and device for advertisement classification
CN110597962B (en) Search result display method and device, medium and electronic equipment
US9779169B2 (en) System for ranking memes
WO2018130201A1 (en) Method for determining associated account, server and storage medium
CN107526718B (en) Method and device for generating text
US8838435B2 (en) Communication processing
WO2019055654A1 (en) Systems and methods for cross-media event detection and coreferencing
US20160055521A1 (en) Methods, systems, and media for reviewing content traffic
AlMansour et al. A model for recalibrating credibility in different contexts and languages-a twitter case study
Margaris et al. Improving Collaborative Filtering's Rating Prediction Accuracy by Considering Users' Rating Variability
US20230041339A1 (en) Method, device, and computer program product for user behavior prediction

Legal Events

Date Code Title Description
AS Assignment

Owner name: SILICON VALLEY BANK, CALIFORNIA

Free format text: SECURITY INTEREST;ASSIGNOR:INTEGRAL AD SCIENCE, INC.;REEL/FRAME:043305/0443

Effective date: 20170719

AS Assignment

Owner name: ADSAFE MEDIA, LTD., NEW YORK

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ATTENBERG, JOSHUA M.;PROVOST, FOSTER J.;SIGNING DATES FROM 20110831 TO 20110921;REEL/FRAME:045848/0059

AS Assignment

Owner name: INTEGRAL AD SCIENCE, INC., NEW YORK

Free format text: CHANGE OF NAME;ASSIGNOR:ADSAFE MEDIA, LTD.;REEL/FRAME:046245/0517

Effective date: 20121221

AS Assignment

Owner name: GOLDMAN SACHS BDC, INC., AS COLLATERAL AGENT, NEW YORK

Free format text: PATENT SECURITY AGREEMENT;ASSIGNOR:INTEGRAL AD SCIENCE, INC.;REEL/FRAME:046594/0001

Effective date: 20180719

Owner name: INTEGRAL AD SCIENCE, INC., NEW YORK

Free format text: TERMINATION AND RELEASE OF INTELLECTUAL PROPERTY SECURITY AGREEMENT;ASSIGNOR:SILICON VALLEY BANK;REEL/FRAME:046615/0943

Effective date: 20180716

Owner name: GOLDMAN SACHS BDC, INC., AS COLLATERAL AGENT, NEW

Free format text: PATENT SECURITY AGREEMENT;ASSIGNOR:INTEGRAL AD SCIENCE, INC.;REEL/FRAME:046594/0001

Effective date: 20180719

STCV Information on status: appeal procedure

Free format text: NOTICE OF APPEAL FILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCV Information on status: appeal procedure

Free format text: NOTICE OF APPEAL FILED

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCV Information on status: appeal procedure

Free format text: NOTICE OF APPEAL FILED

AS Assignment

Owner name: INTEGRAL AD SCIENCE, INC., NEW YORK

Free format text: RELEASE OF SECURITY INTEREST IN PATENT COLLATERAL AT REEL/FRAME NO. 46594/0001;ASSIGNOR:GOLDMAN SACHS BDC, INC., AS COLLATERAL AGENT;REEL/FRAME:057673/0706

Effective date: 20210929

Owner name: PNC BANK, NATIONAL ASSOCIATION, AS ADMINISTRATIVE AGENT, PENNSYLVANIA

Free format text: PATENT SECURITY AGREEMENT;ASSIGNOR:INTEGRAL AD SCIENCE, INC.;REEL/FRAME:057673/0653

Effective date: 20210929

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: ADVISORY ACTION COUNTED, NOT YET MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCV Information on status: appeal procedure

Free format text: NOTICE OF APPEAL FILED