Atigeo's Top Takeaways From the 2016 Data Science Summit

Events, Jul 14, 2016

Industry-leading data scientists and developers, i.e. Atigeo’s tribe, gathered en masse at the Fairmont Hotel in San Francisco this week for Data Science Summit 2016. David Talby, our Chief Technology Officer, was on board this year to present on the application of machine learning and natural language.

David answered our questions about his presentation, as well as what he learned from talks he attended and conversations with big players at the most impressive companies in the field.

His answers include the best of the last two days in San Francisco, so read on for some of the inside scoop on the latest news and greatest challenges facing the data science industry.

1. Why is Data Science Summit a not-to-be-missed event?

Atigeo has been attending the DSS for three years now, and we go back every year for the fantastic lineup of speakers, the rich technical content, and the hands-on in-industry audience that we get to both educate and learn from.

2. You presented on natural language processing, or semantic natural language understanding. Can you explain this concept in simple terms?

Natural language processing is the extraction of essential meaning from free text (for example, doctor’s notes from the field). In a bigger-picture sense, it’s a method for organizing, interpreting, and gaining utility from the variety and nuances of language.

3. In your presentation, you cited several questions and challenges in the healthcare realm (for example: Who fits this clinical trial? Who needs to be vaccinated? Who is prescribed meds they’re allergic to? Etc.). Is there a common denominator to these questions and, if so, what is it?  

All the questions raised at the beginning of the talk are fundamentally queries about patients. In order to best address patient problems, we need to clearly define patients’ current or previously received treatments and their specific needs. These details become essential criteria for use in clinical trials, such as whether or not patients are at a specific stage of cancer, if they’ve received chemo treatments, if they’re between the ages of eighteen and sixty-four, etc. If we’re thinking about who needs vaccinations, our questions will be about what vaccines said patients have or haven’t received, whether it was during chemo or before a transplant, etc. In the case of unusual side effects of a drug, we’ll have yet another set of questions.

The most essential challenges in healthcare are about patients, which is why Atigeo started there; although, there are many interesting questions about doctors as well (for example, which doctors are most effective for which conditions, types of patients, or cases, etc.).

4. So in making these definitions of patients and their needs, it’s essential to collect the right data. What separates “trivial” from non-trivial data? 

Defining what is “trivial” is subjective when you get into more complicated situations, but there are easy cases. Making inferences from clean, structured data is relatively easy. For example, if a system deduces that “the patient is a child” from the fact that the doctor noted the patient’s age is three: This would be considered trivial. In contrast, if the free text clinical notes mention that a patient arrived alone for his first three chemo treatments and that he has a family history of alcoholism, a non-trivial deduction would be that this patient is at high risk for depression.

5. There are specific tools you use to make this natural language learning possible. What are those tools and how do they work?

Some of the essential tools I mentioned in my talk are curated ontologies, machine learned ontologies, and Unstructured Information Management Architecture (UIMA) annotators. I’ll define one at a time.

A curated ontology is a list of terms and relationships between them, which is generated manually by humans. For example, a list of all known body parts (terms), including synonyms and hierarchy between body parts (relationships), that has been built and edited by human experts, is a curated body parts ontology.

A machine learned ontology is a list of terms and the relationships between them, which is generated by algorithms (as opposed to being manually edited by human experts).

A UIMA annotator is a Java class defined within the UIMA Java framework that includes the code and data necessary to infer one specific type of annotations from a specific text. Essentially, it is a sentence or phrase dissector that finds parts of speech and uses them to deduce meaning. For example, if the system finds a number followed by “lbs.” or “pounds” it can recognize that weight of the patient is given. This recognition of units of measurement can also indicate more complex data like blood pressure, vital signs, etc.

6. The process of natural language learning is a powerful and complex one. What is your opinion about trust and prediction making in such a complex field? (These were two topics of Carlos Guestrin’s talk: “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier”.)

I think that “explain-ability” and trust in machine learning models is of critical importance, and is a problem that both the industry and academia face. At Atigeo, we’ve had cases where the most accurate models produced were not the best for our customers because they were too difficult to explain to those outside the data science field. This in turn makes it difficult to gain the trust of clinicians who would directly benefit from the system created.

In his talk at DSS this week, Carlos Guestrin did a great job of showing the link between “explain-ability” and trust, with examples of models that perform well for machine learning metrics and that are easy to measure, like precision for example. On the other hand, he explained how such models often display tools that, in practice, perform poorly because they rely on features that anyone (not to mention domain experts) can easily recognize as faulty because they over-simplify scenarios or problems.

Developing and utilizing user-based ‘trust’ metrics for how machine learning models are created is, therefore, a critical addition to model-building best practices.

7. So, who else did you connect with at DSS? What topics did you discuss?

I had fantastic, face-to-face, technical discussions with so many leaders of the field the last two days. I spent time with CTOs, Directors, and VPs of data science teams from Elsevier, Google, O’Reilly, QPID, Spare 5, CrowdFlower, Spoken Labs, Tableau, Concur , Intel, Turi (formerly Dato), Domino Labs, Personalics, the West Big Data Innovation Hub, iMerit, and others. Some of these conversations were about partnerships, some are companies Atigeo can help, and many were about where the field is going. We got into the nitty gritty of advances and challenges in deep learning, reinforcement learning, data annotation and enrichment, natural language understanding, visualization, and even data science in academia.

8. From those conversations and the experience as a whole, what are your top takeaways from the DSS?

The buzzword-du-jour is deep learning. There is huge interest in Google’s TensorFlow, which is all about machine learning. Spark still reigns supreme: they still generate a huge amount of interest and continue to be useful to a large community of scientists. And there is burgeoning interest in the kind of work we’re doing here at Atigeo, managing unstructured noisy data like speech, video, images, and natural language text.

The world of data science is growing and becoming more and more complex, and DSS certainly confirmed that this week. I’m excited to see that Atigeo is atop this wave and ahead of the curve when it comes to collecting data and using it to create effective, wise systems for the healthcare field.

Big thanks to the DSS for a great event and the opportunity to come together around our shared future!

Written by xPatterns