what is corpus in nlp

what is corpus in nlp

1 year ago 58
Nature

In natural language processing (NLP), a corpus is a large and structured set of machine-readable texts that have been produced in a natural communicative setting. It is a dataset consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Corpora are used in corpus linguistics for statistical hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. They are also used in search technology as the collection of documents being searched. In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, in which information about each words part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Other notable areas of application include language technology, natural language processing, and computational linguistics.

Read Entire Article