Words, words, words
Authors and writers of all stripes can learn a lot about creating and managing words from computer programmers, beginning with an appreciation for the simple, durable efficiencies of plain text. Anybody running Unix, Linux, or BSD already knows all about text, because it’s the third prong of the Unix Tools Philosophy:
- Write programs that do one thing and do it well;
- Write programs that work together;
- Write programs to handle text streams, because that is a universal interface.
Other writers discover the efficiencies of a good text editor by chance, because they dabble in writing code, even if it’s just HTML, or they need to manage gigantic, book-length text files, without the clunky overhead of a word processor (like Microsoft Word or iWork’s Pages or Open Office Writer).
Unlike world processors, text editors are fast and capable of opening and editing multiple gigabyte-sized files that would cause mere word processors to choke. True geeks prefer plain text for many reasons, some of which are nicely summarized by the folks at the 43 Folders wiki.1 Popular text editors include Vim, Emacs, UltraEdit, TextPad, NoteTab, TextEdit, TextMate, BBedit, to name only a handful of the most popular. Wikipedia has a much longer list.
In the Beginning Were the Words
Writers and authors who don’t use a text editor and instead use mainly word processing programs may wonder: What’s plain text, how is it a “universal interface,” and why do I care?
Plain text is unformatted characters that are program independent and require little processing. Text means words, sentences, paragraphs and, yes, computer code. It’s called plain when it’s stored as unformatted, unadorned text characters in a plain text file, which sometimes has a .txt extension on the filename. A file named readme.txt probably contains plain text; if you double-click on the file icon it might even open in your text editor (probably Notepad or Wordpad if you are on Windows and have not yet downloaded a real text editor).
Plain text does not light up, blink, or spontaneously create hyperlinks to Microsoft’s Bing search engine if you happen to type in a proper noun. You can’t insert your favorite YouTube video or MSNBC news item into a plain text file, although it’s easy enough to copy in the hyperlink.
Plain text means words separated by spaces; sentences separated by periods; paragraphs usually separated by single blank lines. If you are in the writing business, even the publishing or screenwriting business, it’s often all you need.
To the PC user raised on word processors, these spartan virtues sound like deficits, that is, until you want to access the text in your file using a program different than the one you used to create it. Open a Microsoft Word file in a simple text editor (like Notepad, or Textmate, or TextEdit, or BBedit or UltraEdit), and you’ll see gobbledygook, not words.
Open a plain text file with almost any program, including Microsoft Word, Corel WordPerfect, Apple iWorks or any of the hundreds of text editors and word processors on any computer, and you will be able to view and edit that text, just as you could have viewed and edited it twenty or thirty years ago, just as you’ll probably be able to view and edit it twenty or thirty years from now, whether Microsoft still exists or not.
The geeks who made Unix nearly 40 years ago made plain text the universal interface because they believed in economy, simplicity, and reliability. Instead of making big complicated, bloated programs that tried to do everything (“It looks like you are writing a suicide note! Enter your zipcode or area code, and we’ll show you any local laws you may have to comply with before offing yourself.”), Unix programmers prided themselves on making small, well-designed, text-oriented programs or tools that each did one job well. One program found your files, another could open them, another could pipe the text back and forth between programs, another could count the words in a file, another could search files for matching strings of text, and so on. These programs accepted plain text as input, and produced plain text as output. Programming ingenuity meant discovering new ways to combine tools to accomplish a given task, then pass the results along (in plain text) to the next program, which could also capture, process, and produce more plain text, until you ended up with the results you sought.
During the Unix era, only an idiot would have proposed creating programs that couldn’t talk to other programs. Why would anyone create files that can be edited and viewed only by the program that created them? Say, Adobe In Design, or Microsoft Word? To Unix programmers and computer scientists the whole point was to make another tool for the Unix toolbox, then share your work with others, who in turn did likewise, and gradually Unix grew into the perfect computer geek workbench, a collection of small, efficient programs sharing a common file format and universal interface: plain text. As novelist and Uber-geek Neal Stephenson2 put it in his manifesto, In The Beginning Was The Command Line:
Unix … is not so much a product as it is a painstakingly compiled oral history of the hacker subculture. It is our Gilgamesh epic . . . . What made old epics like Gilgamesh so powerful and so long-lived was that they were living bodies of narrative that many people knew by heart, and told over and over again—making their own personal embellishments whenever it struck their fancy. The bad embellishments were shouted down, the good ones picked up by others, polished, improved, and, over time, incorporated into the story. Likewise, Unix is known, loved, and understood by so many hackers that it can be re-created from scratch whenever someone needs it. This is very difficult to understand for people who are accustomed to thinking of OSes as things that absolutely have to be bought.
If Unix is the geek Gilgamesh epic, it’s a tale told in plain text. On a Unix or Linux command line, “cat readme.txt” will print the contents of readme.txt to the screen. From a Windows command line, entering the command “TYPE readme.txt” will do the same. However, if readme.doc is a Microsoft Word document, issuing the command “TYPE readme.doc” will produce a string of illegible symbols, because readme.doc is stored in a proprietary format, in this case, a Microsoft Word file.
Okay, so who cares? Most of us own a license to use Microsoft Word (on one machine, for a certain length of time), or else we can download various readers provided by Microsoft to read Word document files even if we don’t have a big honking Microsoft Word program on our computer. That’s true, for today, anyway.
But what about ten years from now? What about forty years from now? If the past is any guide, when 2019 or 2029 comes around, you will not be able to open, read, and edit a Microsoft Word file that you created in 2012 and left in some remote sector of your capacious hard drive. Why? Because programs change. Companies that make proprietary programs come and go. Yes, even monster companies with the lion’s share of the word processing market. Just ask any customer of Wang Laboratories (the ruling vendor of word processors during the 1980s). Even if the company still exists, they are in the business of selling newer, bigger, more complicated, more sophisticated, and more expensive programs every other year or so. Those newer, “better,” programs come with newer, proprietary file formats, to keep you purchasing those updates.
Tales of Woe from the Elder Geeks
It takes geeks and especially geek writers of a certain age to bring home the hazards of storing information in proprietary file formats. Consider first the Seer of the Singularity himself, Ray Kurzweil, as he looks back over almost forty years of his love affair with technology and the data formats he has accumulated along the way.
In a plaintive, downright sad section of his otherwise generally upbeat take on the future of technology, Kurzweil includes a subsection called “The Longevity of Information” in a chapter of his Singularity book: How to access the data contained on a circa 1960 IBM tape drive or a Data General Nova I circa 1973?
First, Kurzweil explains, you need to find the old equipment and hope it still works. Then you need software and an operating system to run it. Are those still around somewhere? What about tech support, he asks? Hah! You can’t get a help desk worker to call you back about the latest glitch running Microsoft Office much less a program from forty years ago. “Even at the Computer History Museum most of the devices on display stopped functioning many years ago.”3
Kurzweil uses his own archival horror stories as “a microcosm of the exponentially expanding knowledge base that human civilization is accumulating,” then asks the terrible question: What if we are “writing” all of this knowledge in disappearing ink? The upshot of Kurzweil’s elegy to lost data is: “Information lasts only so long as someone cares about it.”
Do you care about your writings? The first order of business is to back up! As Kurzweil sees it, the only way data will remain alive and accessible “is if it is continually upgraded and ported to the latest hardware and software standards.” That’s one way to do it. Another way is to try to use formats that don’t go out of style. I have a 1983 Kaypro computer down in the basement that still works. All of the files I created with text editors or converted to plain text are still legible and formatted just as I left them. Here in 2012, almost 30 years after I created them, I can open them in a different text editor and work on them. The files I created using Wordstar, a proprietary word processor, are lost.
Consider another elegy to a lost file, reprinted here with permission from Neal Stephenson:
I began using Microsoft Word as soon as the first version was released around 1985. After some initial hassles I found it to be a better tool than MacWrite, which was its only competition at the time. I wrote a lot of stuff in early versions of Word, storing it all on floppies, and transferred the contents of all my floppies to my first hard drive, which I acquired around 1987. As new versions of Word came out I faithfully upgraded, reasoning that as a writer it made sense for me to spend a certain amount of money on tools.
Sometime in the mid-1980’s I attempted to open one of my old, circa-1985 Word documents using the version of Word then current: 6.0 It didn’t work. Word 6.0 did not recognize a document created by an earlier version of itself. By opening it as a text file, I was able to recover the sequences of letters that made up the text of the document. My words were still there. But the formatting had been run through a log chipper–the words I’d written were interrupted by spates of empty rectangular boxes and gibberish.
Now, in the context of a business (the chief market for Word) this sort of thing is only an annoyance–one of the routine hassles that go along with using computers. It’s easy to buy little file converter programs that will take care of this problem. But if you are a writer whose career is words, whose professional identity is a corpus of written documents, this kind of thing is extremely disquieting. There are very few fixed assumptions in my line of work, but one of them is that once you have written a word, it is written, and cannot be unwritten. The ink stains the paper, the chisel cuts the stone, the stylus marks the clay, and something has irrevocably happened (my brother-in-law is a theologian who reads 3,250-year-old cuneiform tablets–he can recognize the handwriting of particular scribes, and identify them by name). But word-processing software–particularly the sort that employs special, complex file formats–has the eldritch power to unwrite things. A small change in file formats, or a few twiddled bits, and months’ or years’ literary output can cease to exist.
Now this was technically a fault in the application (Word 6.0 for the Macintosh) not the operating system (MacOS 7 point something) and so the initial target of my annoyance was the people who were responsible for Word. But. On the other hand, I could have chosen the “save as text” option in Word and saved all of my documents as simple telegrams, and this problem would not have arisen. Instead I had allowed myself to be seduced by all of those flashy formatting options that hadn’t even existed until GUIs had come along to make them practicable. I had gotten into the habit of using them to make my documents look pretty (perhaps prettier than they deserved to look; all of the old documents on those floppies turned out to be more or less crap). Now I was paying the price for that self-indulgence. Technology had moved on and found ways to make my documents look even prettier, and the consequence of it was that all old ugly documents had ceased to exist.4
Microsoft Word vs. Plain Text
If longevity of information isn’t high on your list, consider storage requirements. Open a text editor (Notepad if that’s all you have) and type two words: “Hello World!” then save the file and call it hello.txt. Now open a Microsoft Word document, type two words: “Hello World!” then save the document and call it hello.doc.
Now let’s compare the storage requirements for these two two-word files.
- hello.txt — ASCII plain text — 12 bytes;
- hello.doc — Microsoft Word — 19,968 bytes.
The two words “hello world!” saved in a plain text file take up 12 bytes of storage space. Storing the same words in a Microsoft Word file requires roughly 1,664 times as much disk space. If you want to see the kind of information that is embedded in a Word document file, read the article Binary versus ASCII (Plain Text) Files.
Not only is the two-word Microsoft Word document file hello.doc a monster, you also need a $200 word processing program to edit it properly. On my newish MacBook Pro, it still takes 45 seconds for the Microsoft Word beast to lumber into operation. By comparison, my favorite text editor Vim (or MacVim in my case) can open a file containing all of the text in one of my novels in less than three seconds. Not only is Vim fast and rock solid, but I can create files, from the tiny hello.txt to 150,000 word novels, which can be read with hundreds, nay, thousands of different, free programs on any kind of computer in the world.
Why should we care about how big the file is and whether you need special programs to read it? For starters, multiply our little file exercise by billions and trillions of files on hundreds of millions of computers all over the world. Electronic storage costs, at least at the corporate level, are soaring. Business email alone is estimated to be growing by 25-30% annually5
Moore’s Law as applied to hard drives has lulled us into thinking that storage is not a problem, at least not on the home front. But just ask your IT officer or CIO how he or she feels about it? Hard drives are cheap, but secure, off-site, redundant back-ups of massive accumulations of email files bloated by Word files, music files, and video files cost each company millions of dollars each year. It’s called “data proliferation”6 and it’s bringing one corporation after another to its knees in the courts. Companies incur massive legal fines if they are unable to produce emails in litigation, so they err on the side of keeping everything. That policy results in huge electronic storage bills and an inability to find the needles in the data haystacks. These problems are all compounded by proprietary file formats. Not only are proprietary files monstrosities, but to find data in those files requires search and indexing programs capable of accessing dozens if not hundreds of different file formats, all created by different versions of dozens if not hundreds of different programs.
At some point we will have mandatory controls on CO2 emissions, and all of the power plants powering all of the data storage centers will be ripe targets. Is it time to rethink how and why we store gargantuan Microsoft Outlook .pst files for the sake of a few hundred emails that might be relevant to a future lawsuit? Are you beginning to think that those wise men who brought us the Unix Tools Philosophy and its adamant insistence on TEXT as the universal interface were onto something? They were: Plain text—universally readable since the days of Unix in 1970, and still universally readable, using free programs, probably forever.
The data storage crisis is complex and can’t be solved by converting fat Power Point files to text files, but let’s go back to our own laptops where this little experiment in plain text began. File size and electronic storage is not a problem at home … yet. The founding fathers of Unix did not glorify plain text only because they were worried about storage costs. No, they called plain text “universal” because it’s so easy to read, scan, search, access, pipe back and forth, share. Now. Forty years ago, and forty years from now.
Plain text it is! And in true Unix fashion, the best tools for creating and managing text (text editors and file search programs) are often not the same as the best tools for presenting text for the consumption of others (word processors, LaTeX, and other document preparation programs). I hope to post two more articles: One on text editors and Unix file utilities; and another on what might be called document presentation programs: Microsoft Word, LaTeX, Final Draft and Movie Magic Screenwriter. Programs like Highland and pandoc, Markdown and Fountain, and other mark-up and conversion systems allow writers and authors to type text once, and then convert it as needed for the Internet, for print, for e-book, screenplay, or manuscript format.
(Excerpted from Rapture For The Geeks: When AI outsmarts IQ, by Richard Dooling. This is the first of three articles I plan to write on plain text, text editors and other writing tools, including Fountain and Highland for screenwriters and Markdown for novelists.)
- One of the 43 Folders Life Hacks is to keep your to-do list and even your rolodex in a plain text files, as opposed to configuring one of the dozens of to-do widgets and databases du jour that come and go, often with price tags: . For Linux and Mac users, Michael Stutz, author of the popular Linux Cookbook, 2nd Ed. (No Starch Press: San Francisco, 2004), has an excellent HOWTO called CLI Magic: Command-line Contact Management. ↩
- Author of Snowcrash, Cryptnomicon, and the Baroque Cycle trilogy of books. ↩
- The Singularity Is Near, p. 327 ↩
- In The Beginning Was The Command Line. Available at Neal Stephenson’s website (reprinted here with permission from the author). ↩
- IBM Whitepaper, The Toxic Terabyte: How data-dumping threatens business efficiency. ↩
- Data proliferation. ↩