Skip to main content

CS779: Statistical Natural Language Processing

Course Description

Natural language (NL) refers to the language spoken/written by humans. NL is the primary mode of communication for humans. With the growth of the world wide web, data in the form of text has grown exponentially. It calls for the development of algorithms and techniques for processing natural language for the automation and development of intelligent machines. This course will primarily focus on understanding and developing linguistic techniques, statistical learning algorithms, and models for processing language. We will have a statistical approach towards natural language processing, wherein we will learn how one could develop natural language understanding models from statistical regularities in large corpora of natural language texts while leveraging linguistics theories.

 

CS779 is a research project-based course, participants are required to work on open and unsolved research problems in NLP and consequently, considerable effort is expected from the participant. As in the previous offering of the course, work on the project can possibly lead to a publication.

Course Content

Course Contents:

  1. Introduction to Natural Language (NL): why is it hard to process NL, linguistics fundamentals, etc.
  2. Language Models: n-grams, smoothing, class-based, brown clustering
  3. Sequence Labeling: HMM, MaxEnt, CRFs, related applications of these models e.g. Part of Speech
  4. tagging, etc.
  5. Parsing: CFG, Lexicalized CFG, PCFGs, Dependency parsing
  6. Applications: Named Entity Recognition, Coreference Resolution, text classification, toolkits e.g.
  7. Spacy, etc.
  8. Distributional Semantics: distributional hypothesis, vector space models, etc.
  9. Distributed Representations: Neural Networks (NN), Backpropogation, Softmax, Hierarchical Softmax
  10. Word Vectors: Feedforward NN, Word2Vec, GloVE, Contextualization (ELMo, etc.), Subword information (FastText, etc.)
  11. Deep Models: RNNs, LSTMs, Attention, CNNs, applications in language, etc.
  12. Sequence to Sequence models: machine translation and other applications
  13. Transformers: BERT, transfer learning, and applications

References: There are no specific references, this course gleans information from a variety of sources like books, research papers, other courses, etc. Relevant references would be suggested in the lectures. Some of the frequent references are as follows:

  1. Speech and Language Processing, Daniel Jurafsky, James H.Martin
  2. Foundations of Statistical Natural Language Processing, CH Manning, H Schutz
  3. Natural Language Understanding, James Allen
  4. Introduction to Natural Language Processing, Jacob Eisenstein

Course Audience

Pre-requisites :
Must: Introduction to Machine Learning (CS771) or equivalent course, Proficiency in Linear Algebra, Probability and Statistics, Proficiency in Python Programming


Desirable: Probabilistic Machine Learning (CS772), Topics in Probabilistic Modeling and Inference (CS775), Deep Learning for Computer Vision (CS776)