Friday, November 27, 2015

Automatically Tagging and Classifying Patents

Over the past 10+ years, I've worked closely with a number of large companies' IP departments to optimize a simple, but highly manually intensive and repetitive task: the classification of newly filed patents issued by the USPTO into internally organized clusters automatically. One of the major business challenges faced by many IP teams inside large companies is translating new technologies - as represented by technical or engineering publications and patents - into terminology or categories more closely understood by the rest of the business.

The USPTO (US patent office) and EPO (European Patent Office) each dedicate significant energy to assigned classification codes to every published patent. In the US, this is called the USPC, and in Europe, it's called the ECLA. There are modern attempts to unify the classification system (aka CPC), and many competing classifications. The problem, however, is that most large patent teams and organizations don't follow these classifications. Instead, they usually adopt new classifications that more closely align to their technologies. An example might be seen at GoPro, where instead of using the USPTO and EPO classification codes for cameras, digital optics, semiconductors, they might instead want to classify their patents by more internally relevant codes, like "Enclosure", "Vibration Resistance", "Water Proofing", "Image Capture", "Image Storage", "Battery System", "Buttons and Interface", "Software Subsystem". Why would an organization adopt their own internal tagging scheme instead of using the standard USPTO/EPO USPC/ECLA/CPC codes? Because these higher level names are more business friendly. They might be aligned to actual sub-teams within the GoPro organization, rather than following the technological hierarchy that the patent office follows. Large technology companies also rarely (if ever) organize their engineering teams in the way often seen in patent classification coding systems. Therefore, assigning more organizationally-relevant tags to patents also helps direct those patents to the engineering teams more likely to be affected by the newly filed patent.


Suffice it to say, it is often important to read newly filed or granted patents and (re)assign them to the appropriate categories of internally relevant names. In my experience, 75% of organizations do this work manually, by having in-house patent attorneys read each patent and assign it a tag. Another 15% of companies rely on outside counsel to do this work, and 5% of companies simply rely on the USPTO/EPO coding system. The remaining 5% of companies don't use any internal tagging structure, and instead email around patents one by one to relevant engineers or teams (this works in small teams, and then scales badly to large organizations).

Enter the "Patent Autoclassifier":

I've developed a very large dictionary of technical, technology, chemical, biological, mechanical, electronic, medical, scientific etc. terms from a variety of sources. I've mined these to create a unique and very high performing set of stemming and technical thesaurus tools that far outperform the others I've found on the web (that are optimized instead for marketing or social media semantic analysis, not technical analysis). I use these tools to create a set of scripts that I call the "Autoclassifier". It works like this:
  • A patent analyst, IP specialist, or engineer identifies a set of 20-50 patent categories of importance to the organization
  • In each category, we identify 5-10 patents (or technical documents, engineering white papers, or even proprietary internal documents that have never been published) that are exemplary of that technology category - called "training sets"
  • The training sets are fed through the technical term-optimized filters and semantic tools, including stemmers, parsers, technology-specific part of speech taggers, n-tuple identifiers, etc.
  • The cleaned up data is then fed into my Autoclassifier system, and,
  • Each new patent that publishes or grants is then fed, full text, into the Autoclassifier, which automatically assigns the patent 1-3 (on average) tags
Since I'm a big fan of functional programming, I developed the entire system in Clojure and Mathematica, linked together with JLink. The result is a system that is entirely controllable and deployable in only 12 lines of code (or 7 meaningful lines if you aren't counting the syntax parentheses):

And the output is very straightforward as well, delivered in an Excel spreadsheet:

Once the Autoclassifier training set is built, this process takes about 5 seconds to run for every 1,000 new patents filed. I have worked with IP organizations to develop these patent classification training sets for each product they make, each competitor company they look at, and each project team working at the company.

Perhaps the most interesting application I've seen to date is to use the Autoclassifier trained on the training sets that are generated based on the way one company understands its technology, and use that classification system to understand how its 6 closest competitors are investing and filing IP differently. That process today takes $10-100,000's in legal fees to do, while these scripts accomplish the same task within minutes!

Another very interesting use I've seen of these scripts was by a tech company who wanted to find other startup companies to acquire, in several different technology areas. Instead of hiring attorneys and consultants to screen 100's of startup companies, the engineering team identified a dozen examples of technologies they wanted to acquire, I found ~10 patents that exemplified each of the technologies (this took the longest time, about 3 hours of my time), and then used the tool to rank and prioritize the top startup companies filing patents. The total number of patents read and classified was 800,000, which took about an hour of computing time to do.