Philip Ogren

Last updated: June 12, 2008.
My primary research interest is in human language technologies, the use of machine learning and knowledge engineering to create them, and applying these technologies to the biomedical domain. I live in Boulder, CO where I am a full-time student working on a Ph.D. in Computer Science.

Contact Information

Software

Publications

Work History/Timeline

Dates Role Organization
8/2002 - Present PhD Student Department of Computer Science at the University of Colorado at Boulder
Computation Bioscience Program at the University of Colorado Health Sciences Center
The Center for Computational Language and Education Research at the University of Colorado at Boulder.
11/2005 - Present Consultant Health Language, Inc.
Division of Biomedical Informatics at the Mayo Clinic College of Medicine
LawCommons.org
IT.com
Computation Bioscience Program at the University of Colorado Health Sciences Center
11/2005 - 11/2006 Senior Analyst/Programmer Division of Biomedical Informatics at the Mayo Clinic College of Medicine
1/2005 - 10/2005 Senior Professional Research Assistant Computation Bioscience Program (CPB) at the University of Colorado Health Sciences Center
8/2002 - 12/2004 Student Research Assistant Department of Computer Science at the University of Colorado at Boulder
6/2002 - 8/2002 Summer Intern Computation Bioscience Program (CPB) at the University of Colorado Health Sciences Center
3/2002 - 5/2002 Intern BEA Systems, Inc.
1/1998 - 3/2002 Analyst/Programmer Division of Biomedical Informatics at the Mayo Clinic College of Medicine
1/1997 - 8/1997 Intern Division of Biostatistics at the Mayo Clinic College of Medicine

Work Experience

CLEAR
The Center for Computational Language and Education Research is a research group within the Department of Computer Science that focuses on human language technologies (i.e. Natural Language Processing or Computational Linguistics.) The lab has had great success both academically and commercially in areas such as speech recognition, named entity recognition, and semantic role labeling. I am one of the primary developers of ClearTK which is a toolkit for developing natural language processing (NLP) components on top of UIMA.

IT.com
IT.com is a small start-up company that is building a search application for email collections obtained for legal discovery. My responsibility is to create software that parses and analyzes email messages to recover their structures. I applied Conditional Random Fields (CRF) to identify various sections interest that occur in many email messages such as embedded message headers, signatures, and legal disclaimers. The implementation was inspired by and derived from Carvalho and Cohen. The features used consist of regular expressions and matches to various lexicons (e.g. U.S. cities, common first names, etc.) The training data was created using Knowtator (see below.) I also implemented a modified version of Yeh and Harnly to reconstruct email threads. Other tasks have included name normalization, identification of duplicate messages, date normalization, and other miscellaneous cleanup of the data. Additionally, I have built the software infrastructure around these algorithms that takes the messages from their raw form and produces relational data. This was accomplished using UIMA, Lucene, and MySQL.

Mayo Clinic College of Medicine (2006)
I worked on the text analysis team within the Division of Biomedical Informatics, an NIH funded research lab. My responsibilities included project management, software development, and research. The following is a list of my major responsibilities:
  • Legacy code ownership: The text analysis team has created a suite of text analysis tools based on the open source UIMA created by IBM. UIMA provides middleware for defining, implementing, integrating, and deploying text analysis components in a highly flexible and scalable way. The system that was built performs a variety of text analysis tasks (e.g. part-of-speech tagging, named entity recognition, etc.) over a large collection (~20M) clinical notes. The output of this system is fed into a retrieval tool used by clinicians and researchers for retrieval purposes. I helped update the code to work with the latest version of UIMA and updated and extended several modules.
  • System evaluation: The aforementioned text analysis system has never been formally evaluated with respect to the accuracy of codes that it generates. I managed the effort to produce a gold standard test set suitable for evaluation. This involved training four retrieval experts to label mentions of SNOMED-CT codes corresponding to disorders found in clinical notes. Special attention was given to producing a very high quality data set by conducting a study that involved four-way inter-annotator agreement analysis in which two of the annotators were given clinical notes pre-annotated by the MetaMap system.
  • Mawui: I implemented an open source library called Mayo Weka/Uima Integration (Mawui). It provides a mechanism for UIMA components to access the machine learning libraries found in the very popular Weka machine learning environment. Please see the website for details.
  • IBM/Mayo collaboration projects: I was involved with two text analysis projects that are collaborative efforts with IBM Research. One was a Word Sense Disambiguation (WSD) research project and the other tackled the task of parsing Structured Cancer Representations from pathology reports. I managed four annotators who created the gold-standard data sets for both of these projects and helped with miscellaneous analyses and software support tasks.

University of Colorado Health Sciences Center
  • Designed and developed Knowtator a text annotation tool that integrates the Protege knowledge base system for creating training and evaluation corpora suitable for information extraction systems.
  • Helped design and test an information extraction system called OpenDMAP whose release to the open source community is pending.
  • I worked on a large variety of small(er) projects as a graduate student including debugging and extending a Lisp-based knowledge-base system, helping create a database and search index for a local copy of a large (~40G) online literature resource called Medline, performing entity identification in biomedical literature, analyzing a popular ontology (GO) used for annotating protein databases, and use of knowledge base resources to augment NLP activities. I am the first author on two peer-reviewed research papers that I presented at the Pacific Symposium on BioComputing in 2004 and 2005.

Mayo Clinic College of Medicine (1998-2002)
Research and development in the lab centered on application of controlled health vocabularies and computational linguistics techniques to the problem of clinical data encoding and retrieval.
  • I worked as primary developer for a prototype application called the Mayo Vocabulary Processor. This software maps free-text noun phrases from medical data to coded concepts of a controlled medical vocabulary. The coded concepts are in turn used to create compositional problem descriptions for the purpose of concept based indexing. Responsibilities included algorithm development, application design, data modeling, pilot development & deployment, demonstrations/presentations (formal and ad-hoc), research specific tasks (ad-hoc experiments, database retrievals, data summaries).
  • A subsequent project involved indexing a 10M document corpus of clinical notes. We applied context-specific synonymy, acronym and abbreviation expansion, word normalization, spell correction, and controlled vocabularies/ontologies to build intelligent queries. We marked up the corpus text with a data-driven terminology consisting of noun phrases with high selection value determined by frequency and mutual information. The terminology was in turn used to help build queries to increase relevance of results.
  • Additionally, I helped develop and maintain the lab's software infrastructure. This included development of a library of utility classes, investigating and incorporating various software packages, generating/writing documentation and applying version control. I also helped conduct usability studies, reviewed manuscripts, attended conferences and visited other research labs. I was a secondary author on several peer-reviewed conference papers.

Education

Master of Science, Computer Science, University of Colorado at Boulder
  • Courses taken:
    • Artificial Intelligence, James Martin, Fall 2002
    • Linear Programming, Hal Gabow, Fall 2002
    • Natural Language Processing, James Martin, Spring 2003
    • Databases, Buzz King, Spring 2003
    • Data Mining & Statistical Inference, Rick Osborne, Spring 2003
    • Advanced Natural Language Processing, Dan Jurafsky, Fall 2003
    • Programming Languages, Bill Waite, Fall 2003
    • Algorithms, Hal Gabow, Spring 2004
    • Advanced Syntax, Laura Michaelis, Fall 2004
    • Machine Learning, Greg Grudic, Fall 2004
  • GPA = 3.9
  • Passed 3 Ph.D. qualifying exams in the following subject areas: Artificial Intelligence, Databases and Programming Languages.
  • Passed comprehensive exam for PhD
Bachelor of Science, Mathematics, Harding University, 1997
Magna Cum Laude, GPA = 3.82
High School - National Merit Scholar

Skills

Links

References

I will gladly share names and contact information of past and present supervisors and co-workers upon request.