ENSIKLOPEDIA

Tekan Enter untuk memulai pencarian cepat.

Kembali ke Ensiklopedia Arsip Wikipedia Indonesia

Bigram

Case of an n-gram, where n is 2

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2.

The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, and speech recognition.

Gappy bigrams or skipping bigrams are word pairs which allow gaps (perhaps avoiding connecting words, or allowing some simulation of dependencies, as in a dependency grammar).

Applications

Bigrams, along with other n-grams, are used in most successful language models for speech recognition.^[1]

Bigram frequency attacks can be used in cryptography to solve cryptograms. See frequency analysis.

Bigram frequency is one approach to statistical language identification.

Some activities in logology or recreational linguistics involve bigrams. These include attempts to find English words beginning with every possible bigram,^[2] or words containing a string of repeated bigrams, such as logogogue.^[3]

Bigram frequency in the English language

The frequency of the most common letter bigrams in a large English corpus is:^[4]

th 3.56%       of 1.17%       io 0.83%
he 3.07%       ed 1.17%       le 0.83%
in 2.43%       is 1.13%       ve 0.83%
er 2.05%       it 1.12%       co 0.79%
an 1.99%       al 1.09%       me 0.79%
re 1.85%       ar 1.07%       de 0.76%
on 1.76%       st 1.05%       hi 0.76%
at 1.49%       to 1.05%       ri 0.73%
en 1.45%       nt 1.04%       ro 0.73%
nd 1.35%       ng 0.95%       ic 0.70%
ti 1.34%       se 0.93%       ne 0.69%
es 1.34%       ha 0.93%       ea 0.69%
or 1.28%       as 0.87%       ra 0.69%
te 1.20%       ou 0.87%       ce 0.65%

References

↑ Collins, Michael John (1996-06-24). "A new statistical parser based on bigram lexical dependencies". Proceedings of the 34th annual meeting on Association for Computational Linguistics -. Association for Computational Linguistics. pp. 184–191. arXiv:cmp-lg/9605012. doi:10.3115/981863.981888. S2CID 12615602. Retrieved 2018-10-09.
↑ Cohen, Philip M. (1975). "Initial Bigrams". Word Ways. 8 (2). Retrieved 11 September 2016.
↑ Corbin, Kyle (1989). "Double, Triple, and Quadruple Bigrams". Word Ways. 22 (3). Retrieved 11 September 2016.
↑ "English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU". norvig.com. Retrieved 2019-10-28.

Natural language processing

General terms

AI-complete
Bag-of-words
n-gram
- Bigram
- Trigram
Computational linguistics
Natural language understanding
Stop words
Text processing

Text analysis

Argument mining
Collocation extraction
Concept mining
Coreference resolution
Deep linguistic processing
Distant reading
Information extraction
Named-entity recognition
Ontology learning
Parsing
- semantic
- syntactic
Part-of-speech tagging
Semantic analysis
Semantic role labeling
Semantic decomposition
Semantic similarity
Sentiment analysis
Stylometry
- adversarial
Terminology extraction
Text mining
Textual entailment
Truecasing
Word-sense disambiguation
Word-sense induction

Text segmentation	Compound-term processing Lemmatization Lexical analysis Text chunking Stemming Sentence segmentation Word segmentation

Automatic summarization

Multi-document summarization
Sentence extraction
Text simplification

Machine translation

Computer-assisted
Example-based
Rule-based
Statistical
Transfer-based
Neural

Distributional semantics models

BERT
Document-term matrix
Explicit semantic analysis
fastText
GloVe
Language model
- large
- small
Latent semantic analysis
Long short-term memory
Seq2seq
Transformer
Word embedding
Word2vec

Language resources,
datasets and corpora

Types and standards	Corpus linguistics Lexical resource Linguistic Linked Open Data Machine-readable dictionary Parallel text PropBank Semantic network Simple Knowledge Organization System Speech corpus Text corpus Thesaurus (information retrieval) Treebank Universal Dependencies
Data	BabelNet Bank of English DBpedia FrameNet Google Ngram Viewer UBY WordNet Wikidata

Automatic identification
and data capture

Speech recognition
Speech segmentation
Speech synthesis
Natural language generation
Optical character recognition

Topic model

Document classification
Latent Dirichlet allocation
Pachinko allocation

Computer-assisted
reviewing

Automated essay scoring
Concordancer
Grammar checker
Predictive text
Pronunciation assessment
Spell checker

Natural language
user interface

Chatbot
Interactive fiction
Question answering
Virtual assistant
Voice user interface

Formal semantics
Gensim
Hallucination
Natural Language Toolkit
spaCy

Sumber data: id.wikipedia.org via REST API v1

Applications

Bigram frequency in the English language

See also

References