When I was writing code to add EPUB publishing to Mark Book Builder, I found there wasn’t a lot of information online about the EPUB file format. In this article I’m sharing what I learned in the hope that it helps others.
Tools
Two tools helped me learn about the EPUB file format. Sigil is an EPUB book editor. Being able to open EPUB books and see their contents taught me a lot about the EPUB file format.
IDPF, the group that creates the EPUB specification, has a validator tool. Click the Choose File button to upload an EPUB book. Click the Validate button to see if your EPUB book has valid EPUB.
EPUB Overview
An EPUB book is basically a zip archive of a website. Each chapter of your book is a web page. Like a website an EPUB book can include CSS files to style the book, fonts, image files, audio files, and video files.
The Root of the Archive
The root of an EPUB archive has three items.
- mimetype
- META-INF folder
- OEBPS folder
The mimetype File
The mimetype file identifies the book as being an EPUB book. It is a very short file.
application/epub+zip
Make sure you don’t press the Return key to create a new line.
The mimetype file must be the first item in the EPUB archive. The file must be uncompressed.
META-INF Folder
The META-INF folder must have at least one file in it: the container file. The container file has the filename container.xml
. The container file specifies where the book’s content (the OPF file) resides in the book’s EPUB archive. The following code shows a standard container file:
<?xml version="1.0" encoding="UTF-8"?>
<container version="1.0"
xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
<rootfiles>
<rootfile full-path="OEBPS/content.opf"
media-type="application/oebps-package+xml" />
</rootfiles>
</container>
If you place the OPF file inside the OEBPS folder, you should be able to copy and paste the code into your own container file.
OEBPS Folder
Most of your book’s content resides in the OEBPS folder. Your book’s chapters reside in the OEBPS folder along with any additional files, such as image, audio, and video files.
In addition to text, image, audio, and video files, the OEBPS folder contains the following items:
- OPF file
- NAV file
- NCX file
OPF File
The OPF file, named content.opf
, is an XML file that lists the content in the book. The start of the file specifies the XML version and the package version, which is the EPUB version. The following XML code shows the start of an OPF file:
<?xml version="1.0" encoding="utf-8"?>
<package version="3.0" unique-identifier="pub-identifier"
xmlns="http://www.idpf.org/2007/opf">
The version="3.0"
part specifies that the book is an EPUB 3 book.
There are three sections you must include in the OPF file.
- Metadata
- Manifest
- Spine
Metadata
The metadata section contains information about the book. The metadata starts with a <metadata>
tag.
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
An EPUB 3 book requires the following metadata entries:
- Title
- Language
- Identifier
- Modified Date
The title entry specifies the title of the book.
<dc:title id="pub-title">Simple Book: A Beginning</dc:title>
The language specifies the language used to write the book. The following code shows an entry for a book written in United States English:
<dc:language>en-US</dc:language>
The identifier is a unique identifier for the book, such as an ISBN number.
<dc:identifier id="pub-identifier">urn:uid:1250064712</dc:identifier>
The modified date specifies the date and time the book was last modified.
<meta property="dcterms:modified">2019-05-22T12:00:00Z</meta>
You must use the format string CCYY-MM-DDThh:mm:ssZ
to format the date and time. As you can see in the code example, you need the letter T between the date and time and the letter Z after the time. The EPUB standard is very picky about the modified date. You can’t just enter the date. You have to include the date and time in the right format.
Common optional metadata entries include the book’s author, publisher, and copyright. The following code shows an example of a metadata entry for a book’s author:
<dc:creator>Mark Szymczyk</dc:creator>
Add the closing tag to end the metadata section.
</metadata>
Manifest
The manifest contains a list of every file in the EPUB book. The following code contains a short example of a manifest:
<manifest>
<item id="nav" href="Text/nav.xhtml" media-type="application/xhtml+xml"
properties="nav"/>
<item id="Chapter1" href="Text/Chapter1.xhtml" media-type="application/xhtml+xml"/>
<item id="Chapter2" href="Text/Chapter2.xhtml" media-type="application/xhtml+xml"/>
<item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
</manifest>
There are four items in the example: the NAV file, two chapters, and the NCX file. Each manifest item has the following properties:
- id, which identifies the manifest item.
- href, which specifies where the item resides in the OEBPS folder.
- media-type, which specifies the type of file. Text files usually have the type
"application/xhtml+xml"
.
The NAV file has an additional property that specifies it is used to navigate the book as a table of contents.
A real book is going to have a much longer manifest. There will be a manifest entry for each chapter in the book as well as an entry for each image used in the book.
Spine
The spine contains a list of all the files in the book in linear reading order.
<spine toc="ncx">
<itemref idref="Chapter1"/>
<itemref idref="Chapter2"/>
</spine>
There is an <itemref>
tag for each file in the spine.
Navigation
Navigation is an important feature in an ebook. People want to jump to specific chapters and sections in a book. As a reader it would be annoying to have to navigate page by page.
EPUB has two files for book navigation in e-readers: the NAV file and the NCX file.
NAV File
In EPUB 3 you use the NAV file, named nav.xhtml
, to declare the book’s table of contents. The start of the file contains boilerplate code identifying the book as an XHTML document.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops"
xml:lang="en" lang="en">
The header follows. It usually contains the title of the book.
<head>
<title>Simple Book: A Beginning</title>
</head>
The table of contents is an ordered HTML list. Each list item is an HTML link whose destination is the location of the item inside the EPUB archive.
<body>
<nav epub:type="toc" id="toc">
<h1>Table of Contents</h1>
<ol>
<li><a href="../Text/Chapter1.xhtml">Chapter 1</a></li>
<li><a href="../Text/Chapter2.xhtml">Chapter 2</a></li>
</ol>
</nav>
</body>
</html>
NCX File
The NCX file, named toc.ncx
, also contains the book’s table of contents. The NCX file provides compatibility with older EPUB versions.
The start of the file contains boilerplate code identifying the book as an NCX document.
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE ncx PUBLIC "-//NISO//DTD ncx 2005-1//EN"
"http://www.daisy.org/z3986/2005/ncx-2005-1.dtd">
<ncx version="2005-1" xmlns="http://www.daisy.org/z3986/2005/ncx/">
The header follows.
<head>
<meta content="urn:uid:1250064712" name="dtb:uid"/>
<meta content="0" name="dtb:depth"/>
<meta content="0" name="dtb:totalPageCount"/>
<meta content="0" name="dtb:maxPageNumber"/>
</head>
The first meta entry is the identifier. The identifier must match the identifier you gave the book in the metadata in the OPF file. The second meta entry lets you specify how many levels and sub-levels appear in the table of contents menu. You shouldn’t have to change the last two meta entries.
The title of the book follows the header.
<docTitle>
<text>Simple Book: A Beginning</text>
</docTitle>
The table of contents appear as a navigation map. Each item in the navigation map has a navigation point. The navigation point contains an ID and its order in the book. Each navigation point includes a navigation label and the location of the item in the book.
<navMap>
<navPoint id="nav_1" playOrder="1">
<navLabel>
<text>Chapter 1</text>
</navLabel>
<content src="Text/Chapter1.xhtml"/>
</navPoint>
<navPoint id="nav_2" playOrder="2">
<navLabel>
<text>Chapter 2</text>
</navLabel>
<content src="Text/Chapter2.xhtml"/>
</navPoint>
</navMap>
</ncx>
Text Folder
Most EPUB books place their chapters inside a Text folder inside the OEBPS folder. It’s not mandatory to have a Text folder, but having your chapters in a separate folder keeps your EPUB archive organized.
Additional Folders
In addition to a Text folder, having the following additional folders can help you keep track of your book’s files:
- A Styles folder for CSS files to style your book
- An Images folder for your book’s images
- A Fonts folder for fonts you embed in your book
- An Audio folder for audio files
- A Video folder for video files
A Sample Chapter
The last thing your EPUB needs is chapters. Chapters are XHTML files. You should have one XHTML file for each chapter in the book. The following markup shows the shell of an XHTML file for a chapter:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops"
xml:lang="en">
<head>
<title>Chapter Title</title>
</head>
<body>
</body>
</html>
The contents of the chapter go between the <body>
and </body>
tags.
Additional Reading
Elizabeth Castro’s book, EPUB Straight to the Point, has a chapter on the EPUB file format that I found helpful.
Liza Daly wrote two articles on IBM’s developer site on EPUB that may help you learn about the EPUB file format.
- Build a digital book with EPUB
- Create rich-layout publications in EPUB 3 with HTML5, CSS3, and MathML
If you prefer video, Apple has two WWDC videos on EPUB.