Natural Language Processing
Natural language processing (NLP) is an interdisciplinary field at the intersection of computer science, artificial intelligence, and linguistics. NLP is concerned with the interaction between computers and human (natural) languages, and with the processing of textual information for various information-based applications.
NLP is concerned with linguistic analysis at various level: morphology, syntax, semantics, pragmatics.
Applications
Applications in NLP include text categorization, text summarization, machine translation, sentiment analysis, and question-answering, etc.
Natural Language Understanding (NLU) aims to help computers understand, manipulate and interpret human language. The NLU is at the heart of the increasingly popular dialogue systems.
Linguistic Analysis
- Sentences: breaking up a document into sentences
- Words: breaking up a document into words (where punctuation is a word)
- Lexical Units (compound nouns): breaking up a document into words, but keeping compound nouns together
- Morphology (finding roots): breaking up a document into the roots of a sentence
- Part-of-Speech (Grammatical Category): breaking up a sentence into grammatical features
- Syntactic Analysis Constituency: breaking up a sentence using Part-of-Speech and finding the types of phrases (i.e. Noun, Preposition, Verb)
- Syntactic Analysis Dependencies: breaking up a document using Part-of-Speech and finding the dependencies between them
- Coreference Analysis: Seeing what each subject in a document refers to
Ambiguity
Why is NLP so hard? Because language is AMBIGUOUS. Ambiguity is THE most difficult problem in NLP, and it comes in all flavors and forms.
- Various forms of words
- ex: cell-phone, cellphone, cell phone
- Sentence splitting
- ex: (standard with separators) The A.I. course CSI20.02-7 will be offered in winter 2021.
- ex: (texto) not sure about tonight will check my schedule when I get home
- POS tag
- ex: Will will finally have the will to write his will.
NLP Pipeline
Steps:
- Tokenization: Break it into words
- Lemmatization: Use morphology
- POS tagging
- Sentence segmentation: Break document into sentences
- Constituency or Dependency parsing: Use syntactic analysis constituency and dependencies
Named-entity recognition
A subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories, such as names of persons, organizations, locations, expressions of times.
Entity types
- Enamex: Person, Org, locations
- Timex: Date, Time
- Numex: Money, Percentage, Quantity
NER Approaches
Regular expressions
- Entities with regular forms (predictable)
- Examples: dates, phone numbers, postal codes, emails
- adaptable to new "formatted" data
- not trivial to write the expressions, requires NE which have a regular surface form structure
- most languages have Regex support
Gazetteers
- Entities with a limited number of instances (enumeration)
- Examples: cities, countries, companies
- easy to have a first working system, many lists exist for various NE and Concepts
- Require additional matching algorithms for typographical errors
- Wikipedia and other resources.
Edit Distance
Calculate the distance between the surface forms.
- Deletion +1
- Insertion +1
- Replacement +1 (or +2 in Levenhstein distance)
Supervised Machine Learning in NLP
Formulate a NLP task as a SML problem
- Think of what is to be predicted
- Yes/No sentence split
- POS tag
- Named Entity
- Obtain ANNOTATED data
- Look for an existing dataset
- Build your own dataset
- Develop/use an annotation platform
- Think of input features
- What COULD be useful in prediction?
- How easy can we get this information?
Word Embeddings
Using a neural network, construct a word representation which encapsulates the word’s predictive nature of its context. A word becomes a vector representation in an N dimensional space, which is called an embedding.
Embedding are inputted into a model through learned weights.