Semantic Language Understanding With a Side of Machine Learning

Engineering, Jun 01, 2016

As I look forward to Strata + Hadoop World London, I find myself thinking of the rich history of London, and of London’s giants in medicine and computer science.  For instance, the Florence Nightingale Museum is just a short Tube ride from the birthplace of Alan Turing. That’s just two of a host of medical and computer science related sites in and near London, to say nothing of the area’s burgeoning startup culture.

At Atigeo, back in the Seattle area, we focus on healthcare analytics as part of being a “compassionate technology company,” and this led us to become very interested in medical encounter notes.  These are the notes a physician takes during or after an exam, documenting the patient’s medical history, their symptoms, and other relevant information. These notes must be assigned an alphanumeric code, required for insurance reimbursements as well as for the assignment of medical resources, and to guide research and follow-up. Usually, a medical coder assigns codes in systems like ICD-10, which encodes illnesses and injuries from the everyday to the bizarre, and includes details like the laterality of an injury, and whether the visit is the first for a particular illness or injury.

Here is how all of this relates to the topic Claudiu and I will be speaking on:  “Semantic Natural Language Understanding with Spark, UIMA, and Machine-Learned Ontologies.” The complexity of ICD-10 and the growing numbers of patient visits puts a strain on the medical coding process. And that strain is growing worse each year.  So Atigeo created xPatterns CLP, our advanced natural language processing system, to help address this problem.

NLP and Search

Many of us at Atigeo came from a background in online search, which requires the construction and constant maintenance of pipeline to crawl and index the huge number of documents on the web. Those documents are tokenized — broken into their linguistic components — and analyzed by file format and HTML markup structure for important cues as to each document’s meaning. Search engine users get assistance through by-products of the search indexing process, such as spelling correction and auto-suggestion of terms that might complete the searcher’s query (which may itself lead searchers to new insights, or unintended mirth.) And, finally, all these documents are relevance ranked according to, among other rank components, which other documents link to the document in question, and the reputation of those linking documents. 

This level of document analysis is not sufficient, however, for assigning a correct ICD-10 code to a medical document. Searchers have learned to type short queries like “Scala language” or “Tube stations near the Shard,” but physicians do not, and should not, carefully phrase their encounter notes to suit such simplistic language processing. You’d expect to see “she denies seasonal allergies and post-nasal drip,” but not a search query-style “not ‘allergies’ and not ‘post-nasal drip’.” Even that search-style formulation loses the shades of meaning in saying the patient “denies” something, meaning that the doctor suspects she may in fact have the symptoms but for some reason does not wish to say so.

Semantic search techniques to the rescue?

Shouldn’t it help to focus on the of those documents we’re busily indexing and ranking? A search engine, in order to successfully respond to typical and natural queries, must give good answers to subtler, contextually-influenced, or even ambiguous queries, such as “where’s the best curry takeaway near me?” Those queries can be analyzed and broken down into facets, so that a search for “best coat to pack for London” would produce, on a shopping-focused website, in “facets,” which are dynamic qualifiers to narrow down the search. In this example, brand and price range would be good facets, if search experience shows that brand and price are top ways that shoppers select coats. 

Adding ML to the mix

Machine learning brings even more to the table, no pun intended, when applied to documents like a review of a favorite sushi spot of mine back in Seattle. A simple semantic or even dictionary-based approach might find expected words in a review of a sushi restaurant such as “sushi bar” and “omokase.”

Explaining machine learning, or ML, is a bit beyond what I have space for in this blog. But for search engines, consider the vast numbers of searchers all querying for many of the same things. The search engine can apply machine learning techniques and use as a “signal” whether users click each and or ignore it. Search engines may also may have access to signals from social media and other user inputs to indicate which documents are highly relevant to a particular search query.

So, machine learning could find more surprising values for a search for “sushi in Seattle,” like “documentary film,” and would be able to build a surprising facet like “movie-related” for restaurants. By surprising searchers, or even delighting them, a search engine can differentiate itself, allowing Seattle visitors who are film fans to have the “Jiro Dreams of Sushi” lunch of dreams.

Bringing this all back to medical notes, however, we are still short of what’s needed to help the overburdened medical coders and physicians. Returning to our example, the note may also contain these statements, which we will assume come in order.

  • “Jane complains about flu-like symptoms,” contains a that the symptoms resemble the flu, and it remains unclear in this sentence made that speculation.
  • “Jane is at high risk for flu if she’s not vaccinated,” so Jane’s risk could possibly be reduced.
  • “Jane’s older brother had the flu last month,” so we know a new fact about Jane’s brother.

Note, however we still do not know the doctor’s diagnosis of Jane herself.

  • “Jane expressed concerns about the risks of bird flu in her neighborhood,” which does not say anything about Jane’s diagnosis.
  • Jane shows no signs of bird flu, except for shortness of breath,” meaning that Jane does in fact exhibit one symptom of bird flu.
  • “Overtired but not see dehydration,” may be a mistype (“see” vs. “seeing”), and lacks a subject or object. Is Jane the one who is tired and dehydrated, or, naively, could it be the doctor who is having problems here?

Our talk at Strata + Hadoop London details how Claudiu and I applied ML, as well as Apache Spark for very fast big data processing and UIMA (Unstructured Information Management Architecture) to produce high-accuracy results in natural language understanding.

While this blog discusses a medical coding scenario, we discussed at Strata London a wide variety of use cases for xPatterns CLP, our natural language system.

We will share our slides on SlideShare and welcome your questions via Twitter to me. Now, off to those medical museums, with a stop for sushi at the Shard on the way. Cheers!


Written by David Talby

Chief Technology Officer