ENSIKLOPEDIA
Module:WikitextParser/sandbox
| This is the module sandbox page for Module:WikitextParser (diff). See also the companion subpage for test cases (run). |
| This module is rated as beta. It is considered ready for widespread use, but as it is still relatively new, it should be applied with some caution to ensure results are as expected. |
This module is a general-purpose wikitext parser. It's designed to be used by other Lua modules and shouldn't be called directly by templates.
UsageUsage
First, require WikitextParser and get some wikitext to parse. For example:
local parser = require( 'Module:WikitextParser' )
local title = mw.title.getCurrentTitle()
local wikitext = title:getContent()
Then, use and combine the available methods. For example:
local sections = parser.getSections( wikitext )
for sectionTitle, sectionContent in pairs( sections ) do
local sectionFiles = parser.getFiles( sectionContent )
-- Do stuff
end
MethodsMethods
getLeadgetLead
getLead( wikitext )
Returns the lead section from the given wikitext. The lead section is defined as everything before the first section title. If there's no lead section, an empty string will be returned.
getSectionsgetSections
getSections( wikitext )
Returns a table with the section titles as keys and the section contents as values. This method doesn't get the lead section (use getLead for that).
getSectiongetSection
getSection( wikitext, sectionTitle )
Returns the content of the section with the given section title, including subsections. If you don't want subsections, use getSections instead. If the given section title appears more than once, only the first will be returned. If the section is not found, nil will be returned.
getSectionTaggetSectionTag
getSectionTag( wikitext, tagName )
Returns the contents of the <section> tag with the given tag name (see Help:Labeled section transclusion). If the tag is not found, nil will be returned.
getListsgetLists
getLists( wikitext )
Returns a table with each value being a list (ordered or unordered).
getParagraphsgetParagraphs
getParagraphs( wikitext )
Returns a table with each value being a paragraph. Paragraphs are defined as block-level elements that are not lists, templates, files, categories, tables or section titles.
getTemplatesgetTemplates
getTemplates( wikitext )
Returns a table with each value being a template.
getTemplategetTemplate
getTemplate( wikitext, templateName )
Returns the template with the given template name.
getTemplateNamegetTemplateName
getTemplateName( templateWikitext )
Returns the name of the given template. If the given wikitext is not recognized as that of a template, nil will be returned.
getTemplateParametersgetTemplateParameters
getTemplateParameters( templateWikitext )
Returns a table with the parameter names as keys and the parameter values as values. For unnamed parameters, the keys are numerical. If the given wikitext is not recognized as that of a template, nil will be returned.
getTagsgetTags
getTags( wikitext )
Returns a table with each value being a tag and its contents (like <div>, <gallery>, <ref>, <noinclude>). Tags inside tags will be ignored. If you're interested in getting them, run this method again for each of the returned tags.
getTagNamegetTagName
getTagName( tagWikitext )
Returns the name of the tag in the given wikitext. For example 'div', 'span', 'gallery', 'ref', etc.
getTagAttributegetTagAttribute
getTagAttribute( tagWikitext, attribute )
Returns the value of an attribute in the given tag. For example the id of a div or the name of a reference.
getGalleriesgetGalleries
getGalleries( wikitext )
Returns a table with each value being a gallery.
getReferencesgetReferences
getReferences( wikitext )
Returns a table with each value being a reference. This includes self-closing references (like <ref name="foo" />) as well as full references.
getTablesgetTables
getTables( wikitext )
Returns a table with each value being a wiki table.
getTableAttributegetTableAttribute
getTableAttribute( tableWikitext, attribute )
Returns the value of an attribute in the given wiki table. For example the id or the class.
getTablegetTable
getTable( wikitext, id )
Returns the wiki table with the given id. If not found, nil will be returned.
getTableDatagetTableData
getTableData( tableWikitext )
Returns a Lua table representing the data of the given wiki table.
getLinksgetLinks
getLinks( wikitext )
Returns a Lua table with each value being a wiki link. For external links, use getExternalLinks instead.
getFileLinksgetFileLinks
getFileLinks( wikitext )
Returns a Lua table with each value being a file link.
getFileNamegetFileName
getFileName( fileWikitext )
Returns the name of the given file. If the given wikitext is not recognized as that of a file, nil will be returned.
getCategoriesgetCategories
getCategories( wikitext )
Returns a Lua table with each value being a category link.
getExternalLinksgetExternalLinks
getExternalLinks( wikitext )
Returns a Lua table with each value being an external link. For internal links, use getLinks instead.
See alsoSee also
- Module:Excerpt - Main caller of this module
- mw:WikitextParser.js - Similar parser written in JavaScript, for use in gadgets, user scripts and other tools
Editors can experiment in this module's sandbox (edit | diff) and testcases (edit | run) pages.
Add categories to the /doc subpage. Subpages of this module.
-- Module:WikitextParser is a general-purpose wikitext parser
-- Documentation and master version: https://en.wikipedia.org/wiki/Module:WikitextParser
-- Authors: User:Sophivorus, User:Certes, User:Aidan9382, et al.
-- License: CC-BY-SA-4.0
local parser = {}
-- Private helper method to escape a string for use in regexes
local function escapeString( str )
return string.gsub( str, '[%^%$%(%)%.%[%]%*%+%-%?%%]', '%%%0' )
end
-- Private helper method to strip wikitext from section titles
-- Copied from [[Module:Plain text]]
local function plainText( str )
return mw.text.killMarkers( str )
:gsub( ' ', ' ' ) -- replace nbsp spaces with regular spaces
:gsub( '<br ?/?>', ', ' ) -- replace br with commas
:gsub( '<span.->(.-)</span>', '%1' ) -- remove spans while keeping text inside
:gsub( '<i.->(.-)</i>', '%1' ) -- remove italics while keeping text inside
:gsub( '<b.->(.-)</b>', '%1' ) -- remove bold while keeping text inside
:gsub( '<em.->(.-)</em>', '%1' ) -- remove emphasis while keeping text inside
:gsub( '<strong.->(.-)</strong>', '%1' ) -- remove strong while keeping text inside
:gsub( '<sub.->(.-)</sub>', '%1' ) -- remove subscript markup; retain contents
:gsub( '<sup.->(.-)</sup>', '%1' ) -- remove superscript markup; retain contents
:gsub( '<u.->(.-)</u>', '%1' ) -- remove underline markup; retain contents
:gsub( '<.->.-<.->', '' ) -- strip out remaining tags and the text inside
:gsub( '<.->', '' ) -- remove any other tag markup
:gsub( '%[%[%s*[Ff][Ii][Ll][Ee]%s*:.-%]%]', '' ) -- strip out files
:gsub( '%[%[%s*[Ii][Mm][Aa][Gg][Ee]%s*:.-%]%]', '' ) -- strip out use of image:
:gsub( '%[%[%s*[Cc][Aa][Tt][Ee][Gg][Oo][Rr][Yy]%s*:.-%]%]', '' ) -- strip out categories
:gsub( '%[%[[^%]]-|', '' ) -- strip out piped link text
:gsub( '([^%[])%[[^%[%]][^%]]-%s', '%1' ) -- strip out external link text
:gsub( '^%[[^%[%]][^%]]-%s', '' ) -- strip out external link text
:gsub( '[%[%]]', '' ) -- then strip out remaining [ and ]
:gsub( "'''''", '' ) -- strip out bold italic markup
:gsub( "'''?", '' ) -- not stripping out '''' gives correct output for bolded text in quotes
:gsub( '%-%-%-%-+', '' ) -- remove -- -- lines
:gsub( '^%s+', '' ) -- strip leading
:gsub( '%s+$', '' ) -- and trailing spaces
:gsub( '%s+', ' ' ) -- strip redundant spaces
end
-- Get the lead section from the given wikitext
-- The lead section is any content before the first section title.
-- @param wikitext Required. Wikitext to parse.
-- @return Wikitext of the lead section. May be empty if the lead section is empty.
function parser.getLead( wikitext )
wikitext = '\n' .. wikitext
wikitext = string.gsub( wikitext, '\n==.*', '' )
wikitext = mw.text.trim( wikitext )
return wikitext
end
-- Get the sections from the given wikitext
-- This method doesn't get the lead section, use getLead for that
-- @param wikitext Required. Wikitext to parse.
-- @return Map from section title to section content
function parser.getSections( wikitext )
local sections = {}
wikitext = '\n' .. wikitext .. '\n=='
for title in string.gmatch( wikitext, '\n==+ *([^=]-) *==+' ) do
local section = string.match( wikitext, '\n==+ *' .. escapeString( title ) .. ' *==+(.-)\n==' )
section = mw.text.trim( section )
sections[ title ] = section
end
return sections
end
-- Get a section from the given wikitext (without subsections)
-- If the given section title appears more than once, only the section of the first instance will be returned
-- @param wikitext Required. Wikitext to parse.
-- @param title Required. Title of the section
-- @return Wikitext of the section, or nil if it isn't found. May be empty if the section is empty.
function parser.getSection( wikitext, title )
title = mw.text.trim( title )
local plainTitle = plainText( title )
local sections = parser.getSections( wikitext )
for sectionTitle, sectionContent in pairs( sections ) do
if sectionTitle == title or sectionTitle == plainTitle then
return sectionContent
end
end
end
-- Get the content of a <section> tag from the given wikitext.
-- We can't use getTags because unlike all other tags, both opening and closing <section> tags are self-closing.
-- @param wikitext Required. Wikitext to parse.
-- @param name Required. Name of the <section> tag
-- @return Content of the <section> tag, or nil if it isn't found. May be empty if the section tag is empty.
function parser.getSectionTag( wikitext, name )
name = mw.text.trim( name )
name = escapeString( name )
local sections = {}
for section in string.gmatch( wikitext, '< *section +begin *= *["\']? *' .. name .. ' *["\']? */>(.-)< *section +end= *["\']? *'.. name ..' *["\']? */>' ) do
table.insert( sections, section )
end
if #sections > 0 then
return table.concat( sections )
end
end
-- Get the lists from the given wikitext.
-- @param wikitext Required. Wikitext to parse.
-- @return Sequence of lists.
function parser.getLists( wikitext )
local lists = {}
wikitext = '\n' .. wikitext .. '\n\n'
for list in string.gmatch( wikitext, '\n([*#].-)\n[^*#]' ) do
table.insert( lists, list )
end
return lists
end
-- Get the paragraphs from the given wikitext.
-- @param wikitext Required. Wikitext to parse.
-- @return Sequence of paragraphs.
function parser.getParagraphs( wikitext )
local paragraphs = {}
-- Remove non-paragraphs
wikitext = '\n' .. wikitext .. '\n' -- add newlines to simplfy patterns
wikitext = string.gsub( wikitext, '%f[^\n]<!%-%-.-%-%->%f[\n]', '' ) -- remove comments
wikitext = string.gsub( wikitext, '%f[^\n]%[%b[]%]%f[\n]', '' ) -- remove files and categories
wikitext = string.gsub( wikitext, '%f[^\n]%b{} *%f[\n]', '' ) -- remove tables and block templates
wikitext = string.gsub( wikitext, '%f[^\n]%b{} *%b{} *%f[\n]', '' ) -- remove neighboring tables and block templates
wikitext = string.gsub( wikitext, '%f[^\n]%b{} *<!%-%-.-%-%-> *%b{} *%f[\n]', '' ) -- remove neighboring tables and block templates with a comment among them
wikitext = string.gsub( wikitext, '%f[^\n][*#].-%f[\n]', '' ) -- remove lists
wikitext = string.gsub( wikitext, '%f[^\n]==+[^=]+==+ *%f[\n]', '' ) -- remove section titles
wikitext = mw.text.trim( wikitext )
for paragraph in mw.text.gsplit( wikitext, '\n\n+' ) do
if mw.text.trim( paragraph ) ~= '' then
table.insert( paragraphs, paragraph )
end
end
return paragraphs
end
-- Get the templates from the given wikitext.
-- @param wikitext Required. Wikitext to parse.
-- @return Sequence of templates.
function parser.getTemplates( wikitext )
local templates = {}
for template in string.gmatch( wikitext, '{%b{}}' ) do
if string.sub( template, 1, 3 ) ~= '{{#' then -- skip parser functions like #if
table.insert( templates, template )
end
end
return templates
end
-- Get the requested template from the given wikitext.
-- If the template appears more than once, only the first instance will be returned
-- @param wikitext Required. Wikitext to parse.
-- @param name Name of the template to get
-- @return Wikitext of the template, or nil if it wasn't found
function parser.getTemplate( wikitext, name )
local templates = parser.getTemplates( wikitext )
local lang = mw.language.getContentLanguage()
for _, template in pairs( templates ) do
local templateName = parser.getTemplateName( template )
if lang:ucfirst( templateName ) == lang:ucfirst( name ) then
return template
end
end
end
-- Get name of the template from the given template wikitext.
-- @param templateWikitext Required. Wikitext of the template to parse.
-- @return Name of the template
-- @todo Strip "Template:" namespace?
function parser.getTemplateName( templateWikitext )
return string.match( templateWikitext, '^{{ *([^}|\n]+)' )
end
-- Get the parameters from the given template wikitext.
-- @param templateWikitext Required. Wikitext of the template to parse.
-- @return Map from parameter names to parameter values, NOT IN THE ORIGINAL ORDER.
-- @return Order in which the parameters were parsed.
function parser.getTemplateParameters( templateWikitext )
local parameters = {}
local paramOrder = {}
local params = string.match( templateWikitext, '{{[^|}]-|(.*)}}' )
if params then
-- Temporarily replace pipes in subtemplates and links to avoid chaos
for subtemplate in string.gmatch( params, '{%b{}}' ) do
params = string.gsub( params, escapeString( subtemplate ), string.gsub( subtemplate, '.', { ['%']='%%', ['|']="@@:@@", ['=']='@@_@@' } ) )
end
for link in string.gmatch( params, '%[%b[]%]' ) do
params = string.gsub( params, escapeString( link ), string.gsub( link, '.', { ['%']='%%', ['|']='@@:@@', ['=']='@@_@@' } ) )
end
local count = 0
local parts, name, value
for param in mw.text.gsplit( params, '|' ) do
parts = mw.text.split( param, '=' )
name = mw.text.trim( parts[1] )
if tonumber( name ) then
name = tonumber( name )
end
if #parts == 1 then
value = name
count = count + 1
name = count
else
value = table.concat( parts, '=', 2 );
value = mw.text.trim( value )
end
value = string.gsub( value, '@@_@@', '=' )
value = string.gsub( value, '@@:@@', '|' )
parameters[ name ] = value
table.insert( paramOrder, name )
end
end
return parameters, paramOrder
end
-- Get the tags from the given wikitext.
-- @param wikitext Required. Wikitext to parse.
-- @return Sequence of tags.
function parser.getTags( wikitext )
local tags = {}
local tag, tagName, tagEnd
-- Don't match closing tags like </div>, comments like <!--foo-->, comparisons like 1<2 or things like <3
for tagStart, tagOpen in string.gmatch( wikitext, '()(<[^/!%d].->)' ) do
tagName = parser.getTagName( tagOpen )
-- If we're in a self-closing tag, like <ref name="foo" />, <references/>, <br/>, <br>, <hr>, etc.
if string.match( tagOpen, '<.-/>' ) or tagName == 'br' or tagName == 'hr' then
tag = tagOpen
-- If we're in a tag that may contain others like it, like <div> or <span>
elseif tagName == 'div' or tagName == 'span' then
local position = tagStart + #tagOpen - 1
local depth = 1
while depth > 0 do
tagEnd = string.match( wikitext, '</ ?' .. tagName .. ' ?>()', position )
if tagEnd then
tagEnd = tagEnd - 1
else
break -- unclosed tag
end
position = string.match( wikitext, '()< ?' .. tagName .. '[ >]', position + 1 )
if not position then
position = tagEnd + 1
end
if position > tagEnd then
depth = depth - 1
else
depth = depth + 1
end
end
tag = string.sub( wikitext, tagStart, tagEnd )
-- Else we're probably in tag that shouldn't contain others like it, like <math> or <strong>
else
tagEnd = string.match( wikitext, '</ ?' .. tagName .. ' ?>()', tagStart )
if tagEnd then
tag = string.sub( wikitext, tagStart, tagEnd - 1 )
-- If no end tag is found, assume we matched something that wasn't a tag, like <no. 1>
else
tag = nil
end
end
table.insert( tags, tag )
end
return tags
end
-- Get the name of the tag in the given wikitext
-- @param tag Required. Tag to parse.
-- @return Name of the tag or nil if not found
function parser.getTagName( tagWikitext )
local tagName = string.match( tagWikitext, '^< *(.-)[ />]' )
if tagName then tagName = string.lower( tagName ) end
return tagName
end
-- Get the value of an attribute in the given tag.
-- @param tagWikitext Required. Wikitext of the tag to parse.
-- @param attribute Required. Name of the attribute.
-- @return Value of the attribute or nil if not found
function parser.getTagAttribute( tagWikitext, attribute )
local _, value = string.match( tagWikitext, '^<[^/>]*' .. attribute .. ' *= *(["\']?)([^/>]-)%1[ />]' )
return value
end
-- Get the content of the given tag.
-- @param tagWikitext Required. Wikitext of the tag to parse.
-- @return Content of the tag. May be empty if the tag is empty. Will be nil if the tag is self-closing.
-- @todo May fail with nested tags
function parser.getTagContent( tagWikitext )
return string.match( tagWikitext, '^<.->.-</.->' )
end
-- Get the <gallery> tags from the given wikitext.
-- @param wikitext Required. Wikitext to parse.
-- @return Sequence of gallery tags.
function parser.getGalleries( wikitext )
local galleries = {}
local tags = parser.getTags( wikitext )
for _, tag in pairs( tags ) do
local tagName = parser.getTagName( tag )
if tagName == 'gallery' then
table.insert( galleries, tag )
end
end
return galleries
end
-- Get the <ref> tags from the given wikitext.
-- @param wikitext Required. Wikitext to parse.
-- @return Sequence of ref tags.
function parser.getReferences( wikitext )
local references = {}
local tags = parser.getTags( wikitext )
for _, tag in pairs( tags ) do
local tagName = parser.getTagName( tag )
if tagName == 'ref' then
table.insert( references, tag )
end
end
return references
end
-- Get the reference with the given name from the given wikitext.
-- @param wikitext Required. Wikitext to parse.
-- @param referenceName Required. Name of the reference.
-- @return Wikitext of the reference
function parser.getReference( wikitext, referenceName )
local references = parser.getReferences( wikitext )
for _, reference in pairs( references ) do
local content = parser.getTagContent( reference )
local name = parser.getTagAttribute( reference, 'name' )
if content and name == referenceName then
return reference
end
end
end
-- Get the tables from the given wikitext.
-- @param wikitext Required. Wikitext to parse.
-- @return Sequence of tables.
function parser.getTables( wikitext )
local tables = {}
wikitext = '\n' .. wikitext
for t in string.gmatch( wikitext, '\n%b{}' ) do
if string.sub( t, 1, 3 ) == '\n{|' then
t = mw.text.trim( t ) -- exclude the leading newline
table.insert( tables, t )
end
end
return tables
end
-- Get the id from the given table wikitext
-- @param tableWikitext Required. Wikitext of the table to parse.
-- @param attribute Required. Name of the attribute.
-- @return Value of the attribute or nil if not found
function parser.getTableAttribute( tableWikitext, attribute )
local _, value = string.match( tableWikitext, '^{|[^\n]*' .. attribute .. ' *= *(["\']?)([^\n]-)%1[^\n]*\n' )
if not value or value == '' then
value = string.match( tableWikitext, '^{|[^\n]*' .. attribute .. ' *= *([^\n ]+)[^\n]*\n' )
end
return value
end
-- Get a table by id from the given wikitext
-- @param wikitext Required. Wikitext to parse.
-- @param id Required. Id of the table
-- @return Wikitext of the table or nil if not found
function parser.getTable( wikitext, id )
local tables = parser.getTables( wikitext )
for _, t in pairs( tables ) do
if id == parser.getTableAttribute( t, 'id' ) then
return t
end
end
end
-- Get the data from the given table wikitext
-- @param tableWikitext Required. Wikitext of the table to parse.
-- @return Table data
-- @todo Test and make more robust
function parser.getTableData( tableWikitext )
local tableData = {}
tableWikitext = mw.text.trim( tableWikitext );
tableWikitext = string.gsub( tableWikitext, '^{|.-\n', '' ) -- remove the header
tableWikitext = string.gsub( tableWikitext, '\n|}$', '' ) -- remove the footer
tableWikitext = string.gsub( tableWikitext, '^|%+.-\n', '' ) -- remove any caption
tableWikitext = string.gsub( tableWikitext, '|%-.-\n', '|-\n' ) -- remove any row attributes
tableWikitext = string.gsub( tableWikitext, '^|%-\n', '' ) -- remove any leading empty row
tableWikitext = string.gsub( tableWikitext, '\n|%-$', '' ) -- remove any trailing empty row
for rowWikitext in mw.text.gsplit( tableWikitext, '|-', true ) do
local rowData = {}
rowWikitext = string.gsub( rowWikitext, '||', '\n|' )
rowWikitext = string.gsub( rowWikitext, '!!', '\n|' )
rowWikitext = string.gsub( rowWikitext, '\n!', '\n|' )
rowWikitext = string.gsub( rowWikitext, '^!', '\n|' )
rowWikitext = string.gsub( rowWikitext, '^\n|', '' )
for cellWikitext in mw.text.gsplit( rowWikitext, '\n|' ) do
cellWikitext = mw.text.trim( cellWikitext )
table.insert( rowData, cellWikitext )
end
table.insert( tableData, rowData )
end
return tableData
end
-- Get the internal links from the given wikitext (includes category and file links).
-- @param wikitext Required. Wikitext to parse.
-- @return Sequence of internal links.
function parser.getLinks( wikitext )
local links = {}
for link in string.gmatch( wikitext, '%[%b[]%]' ) do
table.insert( links, link )
end
return links
end
-- Get the file links from the given wikitext.
-- @param wikitext Required. Wikitext to parse.
-- @return Sequence of file links.
function parser.getFiles( wikitext )
local files = {}
local links = parser.getLinks( wikitext )
for _, link in pairs( links ) do
local namespace = string.match( link, '^%[%[ *(.-) *:' )
if namespace and mw.site.namespaces[ namespace ] and mw.site.namespaces[ namespace ].canonicalName == 'File' then
table.insert( files, link )
end
end
return files
end
-- Get name of the file from the given file wikitext.
-- @param fileWikitext Required. Wikitext of the file to parse.
-- @return Name of the file
function parser.getFileName( fileWikitext )
return string.match( fileWikitext, '^%[%[ *.- *: *(.-) *[]|]' )
end
-- Get the category links from the given wikitext.
-- @param wikitext Required. Wikitext to parse.
-- @return Sequence of category links.
function parser.getCategories( wikitext )
local categories = {}
local links = parser.getLinks( wikitext )
for _, link in pairs( links ) do
local namespace = string.match( link, '^%[%[ -(.-) -:' )
if namespace and mw.site.namespaces[ namespace ] and mw.site.namespaces[ namespace ].canonicalName == 'Category' then
table.insert( categories, link )
end
end
return categories
end
-- Get the external links from the given wikitext.
-- @param wikitext Required. Wikitext to parse.
-- @return Sequence of external links.
function parser.getExternalLinks( wikitext )
local links = {}
for link in string.gmatch( wikitext, '%b[]' ) do
if string.match( link, '^%[//' ) or string.match( link, '^%[https?://' ) then
table.insert( links, link )
end
end
return links
end
return parser