Benutzer:Dirk Huenniger/wb2pdf/manual

About MediaWiki2LaTeX

Bearbeiten

The MediaWiki2LaTeX software converts Wikimedia articles and other pages (Wikipedia Books, Wikibooks, etc.) to other softcopy formats for offline use. It can render a single page or a whole collection of linked pages as a single output file. Book sources supported include:

MediaWiki2LaTeX was conceived and created by Dirk Huenniger during a period of several years, when the Wkimedia Foundation's (WMF) own offline content generation (OCG) software had broken down and was not available.

Once it became stable and a web interface was added, the WMF made hosting space for an online service at wmflabs. The software is open source and can still also be downloaded and installed locally. A Debian/Ubuntu Linux package is available, if you do not want to compile it yourself.

However, its functionality is not yet fully complete and it remains under active development. Programmers with Haskell skills are especially invited to contribute.

Technically, MediaWiki2LaTeX is programmed for the most part in the Haskell functional language. It can accept either raw wikitext or served html web pages as input and typically converts them to the LaTeX typesetting language as an intermediate format.

The documentation on this webpage was originally written by me, Benutzer:Dirk Huenniger. I am the main software developer of mediawiki2latex, so I probably know the technical details best. Other editors updated the manual or improved readability.

Recent changes

Bearbeiten
  • December 2019. Book Contents Page Template expansion option added.
  • November 2019. Online processing improved with faster rendering of larger books.
  • 7 November 2016. MediaWiki2LaTeX servers can now run conversion requests in parallel.

Additional User Documentation

Bearbeiten

There is an independent guide to mediawiki2latex on the Edutech Wiki of the University of Geneva. It may provide complementary information.

Processing Multiple Articles

Bearbeiten
Bearbeiten

You can simply create a new page on the wiki and type something like this:

{{:MyPageOne}}
{{:MyPageTwo}}

The resulting page will display a concatenation of the pages MyPageOne and MyPageTwo.

MediaWiki Collections

Bearbeiten

You may create a collection using MediaWiki's Collection Extension, which also has a "Book Creator" user front end. This page can be processed from the web interface by selecting Template Expansion > Book / Collection or from the command line version using the command line option --bookmode. Just keep in mind that the standard web interface has a time limit of 1 hour, giving around 200 pages, although this can be changed and a 4 hour (800 page) service is also available online. The command line version does not have any limit.

Web Interface

Bearbeiten

Process Limits

Bearbeiten
Online services

There are two services, each with a different time limit and parallel process capability configured. Consequently each can process a different maximum size of collection or book:

If you request something while the full number of requests is already running you will see an error message saying "Not enough resources available to process your request! Your request has been dropped!" In this case you can either try later, try the other server, or if you will be a frequent user and have some technical knowledge you can install the software locally.

If either service accepts your request but times out while processing, it will fail with a timeout error.

Local installation

If you install mediawiki2latex locally there is no time limit. mediawiki2latex is open source software and thus free to use and even to modify.

URL to the Wiki to be converted

Bearbeiten

This is the full URL or web address for the Wikipedia article you wish to convert. You can just open the page you want with you web browser and copy the contents of the address bar at the top your browser. That is already the correct URL which you can just paste here.

If you're linking to the tool, you can also preload the URL like so:

https://mediawiki2latex.wmflabs.org/fill/encoded-url

for example, to preload Natur: Zukunft the link would be:

https://mediawiki2latex.wmflabs.org/fill/https%3A%2F%2Fde.wikibooks.org%2Fwiki%2FNatur%3A_Zukunft

When writing templates, you can use the following pattern:

https://mediawiki2latex.wmflabs.org/fill/{{urlencode:{{fullurl:{{FULLPAGENAME}}}}}}

When writing JavaScript, you can use the following pattern instead:

'https://mediawiki2latex.wmflabs.org/fill/' + encodeURIComponent( 'https://de.wikibooks.org/wiki/' + mw.config.get( 'wgPageName' ) )

Output Formats

Bearbeiten

You can choose between the following output formats:

  • PDF: A PDF file of the article you selected by supplying the URL to it. The PDF file will be created using the LaTeX typesetting software, which is often used to ẃrite books and articles in mathematics, physics and related fields. LaTeX first converts the content to its native LaTeX format and then outputs it as a PDF file.
  • LaTeX zip: A ZIP file in LaTeX intermediate code for the article. This useful if you want to change the layout using the LaTeX software yourself. In this case you will need to install Ubuntu App on Windows or have a Debian- or Ubuntu-like operating system installed. In order to compile the source with LaTeX you will also have to install the mediawiki2latex package from your distro's repository.
  • EPUB: A file in EPUB2 format suitable for use with E-book readers or compatible browser extensions. EPUB is suitable for reading offline, editing of content and conversion to other formats using programs like Calibre or Pandoc. The EPUB2 format comprises a ZIP archive of mainly XHTML and SVG document and raster graphic images. The content corresponds closely to that of the wikibook itself with the (semantic) markup used by the original authors. EPUBs are always generated from the HTML content, so the Standard, Book / Collection or Book Contents Page Template expansions options (see below) are recommended.
  • ODT (Word Processor): An Open Document Text file. Useful for importing into you favorite word processing software, if you want to modify the article offline.

Template Expansion

Bearbeiten
  • Standard: The default mode recommended especially for Wikipedia articles. The HTML web page generated by MediaWiki is processed and, in most cases, renders a single page or article. However for a Wikipedia book in the Book namespace, the Standard option renders an entire book.
  • Book / Collection: The HTML web page generated by MediaWiki is processed. All links on the first wiki page will be also followed and those pages also processed, but not recursively after that. This allows Wikipedia books in User: space to be rendered. For books at wikibooks referencing the index / table of contents with links to all chapters of the book, or for a collection page, this option automatically generates the complete book. Book authors or users may create specific index pages or collection pages for this purpose, to define the articles in the book.
  • Book Contents Page: For wikibooks and other projects similar to the previous option, but only links within the book namespace are included. This is useful where the contents/index page includes links to book catalogues, categories, related projects or authors which are not directly realted to the current book. Only book content linked from such an index page is included.
  • Expand templates by MediaWiki: The Wikitext source for the pages is processed. Templates are expanded by MediaWiki into Wikitext. The Wikitext is then parsed and processed further. Use this mode if you don't get the result you intended with the standard mode.
  • Expand templates internally: The Wikitext source for the pages is processed. Templates are not expanded automatically but are instead mapped to LaTeX commands using a default mapping file. If a template is not defined in the mapping file, an "unknown template" error message will be written into the output text. This mode can be useful if you intend to compile a wikibook on the English or German Wikibooks. If you know LaTeX and want to create a PDF file that looks exactly the way you want, you can also provide your own mapping file using the -t command line option.

The size of the page you wish to use. Sizes available are:

Size            mm           Inches
A4         210.0 × 297.0   8.27 × 11.69
A5         148.0 × 210.0   5.83 × 8.27
B5         176.0 × 250.0   6.93 × 9.84
Letter     215.9 × 279.4   8.50 × 11.00
Legal      215.9 × 355.6   8.50 × 14.00
Executive  184.2 × 266.7   7.25 × 10.50

Note that for the current EPUB2 output the document rendering is sized by the viewport of the user's viewer, so the paper size is ignored.

Vector Graphics

Bearbeiten

Images and other content may sometimes be included in scalable vector graphics format (SVG), allowing lossless arbitrary scaling. Some PDF generation tools do not support the conversion of SVG to EPS (encapsulated postscript) or PDF vector formats, so the "Rasterize" option converts SVG files to raster (bitmap) graphics format by default

You can override this using the "Keep Vector Form" option. EPUB is intended for reflowable and scalable presentation and SVG viewers support SVG. Often the file size of a vector graphic is much smaller than the equivalent raster image, so it is typically useful to select "Keep Vector Form" for EPUBs.

Note that the current implementation generates a new image within the book each time an image is linked. If the same logo, pictogram or other image appears many times within a book, this may result in a huge file size and long rendering times for such a book. This applies to both the raster and vector options. Therefore, authors may wish to avoid repeated usage of the same image or graphic in any given book.

Command Line Interface

Bearbeiten

Overview of parameters:

  -V, -?, -v    --version, --help     show version number
  -o FILE       --output=FILE         output FILE (REQUIRED)
  -f START:END  --featured=START:END  run selftest on featured article numbers from START to END
  -x CONFIG     --hex=CONFIG          hex encoded full configuration for run
  -s PORT       --server=PORT         run in server mode listen on the given port
  -t FILE       --templates=FILE      user template map FILE
  -r INTEGER    --resolution=INTEGER  maximum image resolution in dpi INTEGER
  -u URL        --url=URL             input URL (REQUIRED)
  -p PAPER      --paper=PAPER         paper size, one of A4,A5,B5,letter,legal,executive
  -m            --mediawiki           use MediaWiki to expand templates
  -h            --html                use MediaWiki generated html as input (default)
  -e            --tableslatex         use LaTeX to gernerate tables
  -n            --noparent            only include urls which a children of start url
  -k            --bookmode            use book-namespace mode for expansion
  -z            --zip                 output zip archive of latex source
  -b            --epub                output epub file
  -d            --odt                 output odt file
  -g            --vector              keep vector graphics in vector form
  -i            --internal            use internal template definitions
  -l DIRECTORY  --headers=DIRECTORY   use user supplied latex headers
  -c DIRECTORY  --copy=DIRECTORY      copy LaTeX tree to DIRECTORY

--version

Shows version and help information

--output=FILE

Set the output file to where the result will be written. On windows you must ensure that the file is currently not open in any kind of software, since it won't be writable in this case. Any dot extension is not evaluated so you will have to set other parameters to define the output format.

--featured

This option is not implemented and the parameter might go away.

--hex

This parameter takes the whole configuration of mediawiki2latex as a single hex encoded string. This is only used by the mediawiki2latex server when it calls its sub processes. This is necessary to avoid shell injection attacks, as the shell will just see a hex encoded string and not try to run any script from that.

--server=PORT

Run mediawiki2latex web interface as http server. List on PORT.

--templates=FILE

Define a custom mapping file of MediaWiki templates to LaTeX commands. And example is given in file templates.user. The original wikitext will be parsed by mediawiki2latex. MediaWiki will not be used to expand any templates. An "Unknown Template" error message will be added to the output PDF file where templates are encountered which are not given in the mapping file.

--resolution=INTEGER

By default all images with a resolution higher that 300 dpi will be scaled down to 300 dpi in order to reduce the size of the resulting PDF file. With this parameter you can override this with your intended resolution. This is helpful if you need to produce a pdf file that is small enough to be uploaded to a file hosting website.

--url=URL

The URL for the main page you wish to convert

--paper=PAPER

The size of the page you wish to use in the PDF. Supported values are some European DIN norms A4, A5, B5 as well as some American formats: letter, legal, executive. In LaTeX it is possible to define more paper sizes in case you need to.

--mediawiki

Use MediaWiki to expand the MediaWiki templates in the wikitext source, then parse and process the resulting expanded wikitext source with mediawiki2latex

--html

Use MediaWiki to generate a HTML page from the wikitext source and parse and process the resulting HTML with mediawiki2latex

--tableslatex

Use LaTeX to gernerate tables. The tables will be rendered using LaTeX instead of chromium. This might give better results, that look more booklike especially for simple tables, but may distort the tables if html is using in the table in a complex way.

-noparent

Only include urls which a children of start url. This is similar to bookmode (see below), but only children of the supplied url a processed.

--bookmode

This mode is for processing collections made with the MediaWiki Collection extension. This includes the pages found in the Book namespace on Wikipedia as well as user defined collections in the User namespace. mediawiki2latex will follow all links in the wikitext, but not recursively. For each link it will load the HTML. It will stitch together all HTML loaded, then parse and process that. This option can be combined with --mediawiki or --internal or --templates causing the download of wikicode instead of HTML.

--zip

Create a zip file of the LaTeX intermediate code generated.

--epub

Create an ePub file of the article as output. Essentially an intermediate HTML file will be created. Images will be processed as usual and mathematical formulas will be rendered as images. This intermediate result will be converted to an ePub file by calibre.

--odt

Create an odt file of the article as output. ODT stands for Open Document Text and can be imported by common word processing software. It is native to OpenOffice and LibreOffice. The same approach with an intermediate HTML file as described above for ePub is done, but the ODT file is created by LibreOffice.

--vector

Include the source vector file in the PDF output, instead of converting to raster (bitmap) format by default. Usually PDF processing and viewing software does not work well with vector graphics, so its not recommended to do so.

--internal

Same as --templates, but uses a default template definition file compiled into the mediawiki2latex executable. This might be useful on German and English wikibooks, since the template definition file contains some reasonable definitions for many templates on these sites.

--headers=DIRECTORY

Copy a directory with custom header files into the temporary LaTeX document tree before running xeLaTeX. This way you can define custom layouts and define you own latex newcommands which makes sense with the --templates option described above.

--copy=DIRECTORY

Copy the LaTeX (and possibly HTML) intermediate file to the given directory. This option is useful if you want to manually edit the LaTeX document and compile it yourself. mediawiki2latex will still do everything requested including the creating of output files and compiling the sources, and will also copy the directory immediately before the compile step.

Wiki Source Page Code

Bearbeiten

MediaWiki2LaTeX is sensitive to some features in the Wikimedia page source. In some cases you can improve the rendering by following the tips given here. It is recommended that you add an HTML comment to the source code, something like this:

<!-- This parameter added to improve print rendering. Please do not remove. -->

There are some rules for the typesetting of tables:

  • Tables can include horizontal and vertical lines and a frame surrounding the table. These will be drawn if and only if the template prettytable or the attribute class="wikitable" is present in the header of table be drawn.
  • It can be useful to reduce the font size for a whole table. This can be achieved by writing latexfontsize="scriptsize" into the header of the table.
  • In contrast to the tolerant behavior of mediawiki, wikipdf requires a new table to start on a new line.
  • You can define the width of columns in a table using the width attribute with a value in percent (%) in the attributes of cells of the table.
  • Table headings are supported. In a large table spanning several pages, it is often required to repeat the header (that is some rows in the beginning of the table) on the beginning of each page. This is done by marking some cells as header cells using the exclamation mark (!) instead of the vertical bar (|) in the wiki syntax. The program considers the fist few rows to be part of the header as long as they continuously contain header cells.

List of Figures

Bearbeiten

A table of images, their authors and licenses is automatically created in the appendix. In order to determine the name of the author, the information template on the description page of the image is analyzed, thus it needs to be present and to have a valid author entry.

Performance Considerations

Bearbeiten
 

In the mediawiki2latex project we try to make everything a correct as possible and to try to use the highest possible quality setting for every step. In particular when we need to decide to either sacrifice runtime or scarifice quality we always decide to sacrifice runtime. Due to continues demand for a faster way to run mediawiki2latex we came up with the tachyon subproject, here we try to get the fasted possible results sacrifying a lot of quality and correctness. The contributer information needed for open source licenses is entierely left out and images are directly copied from links in the html generated by the mediawiki servers. Everything is done in C with lots of security critical bugs and only the most relevant usecases are taken care of.

We got a result here, which took 90 seconds to create. The contributos information was added later on. The normal mediawiki2latex run took 21 minutes and 45 seconds for the same case. That is more than a factor of 14 slower. See on the right for the result:

If you want to give it a try. Install mediawiki2latex in the latest version on Ubuntu following the installation instruction given above, check out the source from git and do:

sudo apt-get install
g++ -I/usr/include/tidy tachyon.cpp  -ltidy
./a.out https://de.wikibooks.org/wiki/Physikalische_Grundlagen_der_Nuklearmedizin:_Druckversion
evince myfile.pdf

Wikipedia Books

Bearbeiten

In order to meet a demand for shorter compilation times for books from the Wikipedia "Book:" name space, the idea of keeping precompiled books ready for download came up. We started to compile all community maintained books on the English Wikipedia (a bit more than 6000) on an old 22nm dual core laptop in November 2019. By the end of December we had compiled more than 20% of them, fixing many small seldom occurring bugs. From the statistics acquired by this project we calculated that we will have compiled all of them by July 2020 if we continued as before. Furthermore we calculated that a full rebuild of all those books would take less than four weeks and incur cloud fees of 320 EUR using a 24 core AMD epyc with 384 Gbyte of ram. We expect that the book creation process will fail in significantly less than 5% of cases. The web space needed for the cache will be about 600 GByte. So from the point of technical feasibility and costs the installation of such a cache is easy by now. The only remaining issues are administrative ones. Which kind of web space will be considered secure enough for linking to from Wikipedia?

Server Configuration

Bearbeiten

Size of Files and Image Resolution

Bearbeiten

Often there is a maximum size of file allowed by the application you want to use the generated pdf in. This often apples to print-on-demand services. You can reduce the output file size by setting the images to a lower resolution, losing some quality. Typical printing machines used for manufacturing books in an industrial manner today use a resolution of 300 dpi. Thus a higher resolution is usually not necessary. You can enter the maximum allowed resolution in the Graphical User Interface (this feature may not be implemented yet). All images with higher resolutions will be reduced accordingly.

Width of Images

Bearbeiten

The width of image will usually be as large as possible, determined by the width of the page as well as the margins. You may modify this behavior by using a px command when including the image in the wiki source text. 400 pixels correspond to the maximum available width. Thus writing 200px will reduce the size to one half of the original size.

Wrapping Images

Bearbeiten

The former template [[Vorlage:Latex Wrapfigure|Latex Wrapfigure]] can be used for that. It takes two parameters, image and width. Width is between 0.0 and 1.0, where 1.0 means full width of text. 0.5 means half the width of the text and so on. Image has to be a link to an image in the wiki notation starting and tailing double square brackets. see also section on used defined templates of this document and manual of the wrapfigure latex package found on ctan.

Templates

Bearbeiten

Automated Expansion

Bearbeiten

In the default case all Templates are expanded by MediaWiki. This is the meaning of the setting Template Expansion = MediaWiki in the GUI.

Manual Expansion for PDF and LaTeX Output

Bearbeiten

It is hard for an algorithm to determine how a mediawiki template should be converted to LaTeX code. This is because templates are implemented using HTML in an extensive manner in order to produce a good looking output on a Webbrowser, which is very different from the "what you get is what you mean" style LaTeX is using. Still all templates are algorithmically expanded by default as explained above. But we recommend an other way of dealing with templates, which will explain now. You have to set Template Inclusion=normal in the GUI. In this case only a limited number of templates is taken into account by wb2pdf. All other templates will cause the text UNKNOWN TEMPLATE message to come up in the resulting file. It is recommendable to search the output files for this string in order to make sure that all templates were processed correctly. To extend the template processor with custom templates you have to modify the file templates.user in the directory wb2pdf/trunk/latex.

[
["mywikitemplate1","MyLaTeXTemplate","paramx","3","paramy"],
["print version cover","LaTeXNullTemplate"],
["GCC_take_home","LaTeXGCCTakeTemplate","1"]
]

it contains a list of sublists. The fist item in each sublist is the name of the template in the wiki. The second is the name of the template in LaTeX. The following n elements of the sublist are the parameters in the wiki, which shall be passed to the template in LaTeX. Certainly you also have to modify templates.tex in the directory wb2pdf/trunk/document/main to add a definition for the LaTeX version of the template. When modifying templates.user be aware that each entry ends with a comma except for the last entry which does not end with a comma. Furthermore umlauts and non ansi characters have to be encoded in decimal utf8 notation this means:

"\195\156berschriftensimulation 5"


This isn't such a big problem since the Unknown Template Error message in main.tex file in directory wb2pdf/trunk/document/main will have exactly this format (decimal utf8 notation), thus you just need to copy and paste them.

If you need to have more degrees of freedom in defining how a template is processed you can also edit the source code of the template processor In order to extend the template processor of mediawiki2latex with you custom templates you need can also modify the function templateProcessor in the file LatexRenderer.hs an to recompile. In order to do so you need to install the Glasgow Haskell compiler as well as its package manager (cabal). Many examples for custom templates are given in LatexRenderer.hs. Still this file is coded in the purely functional programming language Haskell, which having learned about will help you to define the processing of your custom template. LatexRenderer.hs is essentially a code generator writing code in the LaTeX typesetting language which you will also need to learn in order to extend the custom template abilities of wb2pdf.

Manual Expansion for EPUB ODT and HTML Output

Bearbeiten

You need to modify the function templateProcessor in the file HtmlRenderer.hs and recompile after that you need to run mediawiki2latex with the -i command line option. More hints on that can be found on the discussion Page. Just click Diskussion on the top of this page.