Benutzer:Dirk Huenniger/wb2pdf/details

This page is for those interested in working on MediaWiki to LaTeX development, specifically the details of the pdf-generator system.

Important links Bearbeiten

The source code is available at https://sourceforge.net/p/wb2pdf/git/

Bugs can be reported at Benutzer Diskussion:Dirk Huenniger/wb2pdf/Requests

Overview Bearbeiten

MediaWiki to LaTeX consists of two parts. It is entirely written in Haskell.

I am still working on the project using a Debian operating system and latex in order to create pdf files. The use of imagemagick automatically implies the program convert.

All files are stored in /trunk/src if not noted otherwise.

The compiler is written in Haskell. A thoroughly modified part is the templateProcessor function which is to be found in LatexRenderer.hs file. It defines how to render mediawiki templates to LaTeX. It is important to know that Haskell belongs the programming languages with the so called functional programming paradigm. It is a purely functional programming language and therefore totally different from object oriented programming languages but at least as complex as these. So be aware: Functional programming does not automatically imply that it functioning but it is based on functions.

The Parts of the Compiler Bearbeiten

The Haskell program source consists mainly of a compiler to translate an input file - written in mediawiki markup language - into the LaTeX typesetting language from now on called the output file.

The Parser Bearbeiten

At first the input file is parsed to create a parse-tree. The here used parser is based on the parsec parser combinator library. We (the authors and not the translator nor the pizza deliverer) have added a parsing system on top of parsec. This is useful because there are different environments in mediawiki's markup language. Due to this fact the same strings can have different meanings depending on the environment they are used in. For each environment we have a different parser for the beginning and for the end of the environment in question. We assume that the environment begins as soon as the start-parser matches and ends when the end parser matches. To see all environments you may take a closer look at MediaWikiParseTree.hs and particularly at the definition of the data structure "Anything". An excerpt of it follows.

data Anything a
    = Wikilink StartData [Anything a] 
    | ImageMap StartData [Anything a] 
    | Wikitable StartData [Anything a] 
    | Wikiheading StartData [Anything a] 
    | Itemgroup StartData [Anything a] 
    | C a 

The individual environments shown above are Wikilink, ImageMap, WikiTable, Wikiheading, Itemgroup. You notice that StartData and [Anything a] is added to each of them. This has got a certain meaning. Environments can obviously be nested. [Anyting a] stands for a list of environments. So in each environment there is a list of sub environments. A single character is also allowed to be part of this list. This is achieved by "C a" in the definition of "Anything a". Other elements can also be part of this list and are declared in the full definition of Anything a found in MediaWikiParseTree.hs. "Anything a" is a single Node and contains a list of sub-nodes. StartData means that there is some information associated with the beginning of an environment in this parse-tree.

To see individual parsers for the particular environments you may look at nowikip and noincludep in MediaWikiParser.hs. Definitions are to be seen in the following section.

nowikip  = baseParser {
    start = string "">> return (Str ""),
    end = \x-> string "">> return (),
    allowed = everywhere++wikilinkwhere,
    self = NoWiki
}


noincludep  = baseParser {
    start = string "<noinclude>">> return (Str ""),
    end = \x-> string "</noinclude>">> return (),
    self = NoInclude
}

Each of the two main parsers contain a parser for the beginning of the environment. The parser for the beginning is called start. Analogue exists a parser called end which succeeds if it matches to the end of an environment. Each of those parsers match a single string. self stands for the name of the environment node in the parse tree. So NoInclude and NoWiki are part of the definition of "Anything a" in MediaWikiParseTree.hs. Some environments only make sense within certain surrounding environments. An environment may have to be a sub-environment of one or more surrounding environments listed in allowed. You may want to know that end is essentially a lambda function; the passed on parameter is the result returned by the start parser. Therefore the matching of an end parser may depend on the details of the outcome of its associated start parser.

All parsers have to be added to the parser's list in order to become active. The list of parsers can be found in MediaWikiParser.hs and looks as follows:

parsers=[supp,subp,dhunurlp,itagparser,chapterp,prep,...]

The default parser is defined like that

baseParser  = MyParser {
    bad =  pzero,
    start = undefined,
    end = \x-> return (),
    allowed = everywhere,
    self = undefined,
    modify = \s x -> x
}

Two functions are still unexplained. The first one is bad and called so because it is particularly tricky. If bad matches inside an environment, the whole environment is considered to be invalid. If this is the case it is tried to parse in any possible other way rejecting the option of the original environment. The remaining function is modify. It is applied to the list of sub nodes before adding the node to the parse-tree. This enables you to modify during the parsing process. pzero which is used in the standard definition of the bad parser is by the way a parser that never matches. The whole work considering the environments is done in a few lines by three functions called anything, app and app2. I hope you appreciate my work and also this explanation though both may appear slightly complicated.

The Renderer Bearbeiten

The parsetree is rendered into a Latex document using by the conder writen in LatexRenderer.hs. The Parsetree is essentially a list of nodes, where each node may contain an list of subnodes and so on. It is conferted to the body of a latex document by the function treeToLaTeX2, which calles nodeToLatex in order to convert an individual node. Because of the recusive structure of the parsetree, it is likely that nodeToLatex will call treeToLatex2 in order to deal with the subnodes. The signature of treeToLaTeX2 is defined like this:

treeToLaTeX2 :: [Anything Char]->MyState-> (MyState,String)

This means the first parameter is a list of nodes, thus the parsetree and the second is a mutable state. State is necessary for giving unique numbers to the images among a lot of other thing. In haskell the is not way to use state apart from threading it through the whole program as an aditional paramerter. An exception is Haskell State Modes, which is in my view just a fancy notation for threading the state through in the way mentioned above. The function returns a string, which is obviously the body of the desired Latex document. More precisely the function return a tuple consisting of the body of the Latex document and the possbly modified state. Thus the state is taken as a parameter (possebly modified) and returned as a return value. This is what I mean by threading state through.

A function we have to modify particularly often is the templateProcessor. It Signature is defined like this

templateProcessor :: MyState->(String,[(String,[Anything Char])])->(MyState,String)

Again we see the threading of state. It is passed as first parmeter and returned as first entry of the returend tuple. The Second parameter is a tuple. Its signature is obviously

(String,[(String,[Anything Char])])

The first entry is the name of the template. The second is a dictionary (associative array). It contains the names of the parameters of the template as keys as well as their associated parsetrees as values. This dictionary is often called ll. You can check for the existance of a key with hasKey and retrive a value with getValueForKey. You can also make sure the certain keys will exist with a default value using the function ensure. We will now look at an example:

templateProcessor st ("Druckversion Kapitel",ll)=(st, "\\chapter{"++tx++"}" )
  where 
    tx=(treeToLaTeX (getValueForKey "1" ll) st)

First we got the name of the Template it is "Druckversion Kapitel". Thus this definition of templateProcessor will only be used if we actually got a template with this name. The right an side of the equals sign is.

(st, "\\chapter{"++tx++"}" )

Thus is is a tuple. st is the unmodified version of the state we got as first parameter. The second entry in the returend tuple is a string. It is the result of concatenation three strings. The String concatenation operator is ++. The string in the middele is called tx. It is defined by the line below where. So:

tx=(treeToLaTeX (getValueForKey "1" ll) st)

Here we get the value for the key "1" from the dictionary ll. This is the parsetree associated with the first parameter of the template. We just put it into treeToLaTeX in order to get back the latex string representing it.

PEG/Packrat Bearbeiten

We also looked at the possibility of using a Packrat parser. We were not able to compile it with the full grammer, and had to leave out 2/3 of the HTML in the grammer in order to be able to copile the parser on a 2 GByte Machine. We found on a 700KByte input file on a maschine roughly as fast as an Intel Atom 230:

old parser (splitting into chunks of 10KByte before parsing)   3 Minutes
old parser (not spltting)                                     14 Minutes
packrat parser                                                 6 Minutes 

So we see that the linear time O(n) behaviour of the packrat parser is an advantage for long source files. Still a wiki sourcefile is usually much shorter thus we will for now keep with the old parser.

Current parser Bearbeiten

The linked page explains how the parser works.

Benutzer:Dirk_Huenniger/wb2pdf/parser