Plain text formats
The simplest way to store information in computer memory is as a single file with a plain text format.
Plain text files can be thought of as the lowest common denominator of storage formats; they might not be the most efficient or sophisticated solution, but we can be fairly certain that they will get the job done.
The basic conceptual structure of a plain text format is that the data are arranged in rows, with several values stored on each row.
It is common for there to be several rows of general information about the data set, or metadata, at the start of the file. This is often referred to as a file header.
A good example of a data set in a plain text format is the surface temperature data for the Pacific Pole of Inaccessibility (see Section 1.1). Figure 5.2 shows how we would normally see this sort of plain text file if we view it in a text editor or a web browser.
VARIABLE : Mean TS from clear sky composite (kelvin) FILENAME : ISCCPMonthly_avg.nc FILEPATH : /usr/local/fer_data/data/ SUBSET : 48 points (TIME) LONGITUDE: 123.8W(-123.8) LATITUDE : 48.8S 123.8W 23 16-JAN-1994 00 / 1: 278.9 16-FEB-1994 00 / 2: 280.0 16-MAR-1994 00 / 3: 278.9 16-APR-1994 00 / 4: 278.9 16-MAY-1994 00 / 5: 277.8 16-JUN-1994 00 / 6: 276.1 … |
This file has 8 lines of metadata at the start, followed by 48 lines of the core data values, with 2 values, a date and a temperature, on each row.
There are two main sub-types of plain text format, which differ in how separate values are identified within a row:
Delimited formats:
In a delimited format, values within a row are separated by a special character, or delimiter. For example, it is possible to view the file in Figure 5.2 as a delimited format, where each line after the header consists of two fields separated by a colon (the character `:’ is the delimiter). Alternatively, if we used whitespace (one or more spaces or tabs) as the delimiter, there would be five fields, as shown below.
Fixed-width formats:
In a fixed-width format, each value is allocated a fixed number of characters within every row. For example, it is possible to view the file in Figure 5.2 as a fixed-width format, where the first value uses the first 20 characters and the second value uses the next 8 characters. Alternatively, there are five values on each row using 12, 3, 2, 6, and 5 characters respectively, as shown in the diagram below.
At the lowest level, the primary characteristic of a plain text format is that all of the information in the file, even numeric information, is stored as text.
We will spend the next few sections at this lower level of detail because it will be helpful in understanding the advantages and disadvantages of plain text formats for storing data, and because it will help us to differentiate plain text formats from binary formats later on in Section 5.3.
The first things we need to establish are some fundamental ideas about computer memory.
The most fundamental unit of computer memory is the bit. A bit can be a tiny magnetic region on a hard disk, a tiny dent in the reflective material on a CD or DVD, or a tiny transistor on a memory stick. Whatever the physical implementation, the important thing to know about a bit is that, like a switch, it can only take one of two values: it is either “on” or “off”.
A collection of 8 bits is called a byte and (on the majority of computers today) a collection of 4 bytes, or 32 bits, is called a word.
A file is simply a block of computer memory.
A file can be as small as just a few bytes or it can be several gigabytes in size (thousands of millions of bytes).
A file format is a way of interpreting the bytes in a file. For example, in the simplest case, a plain text format means that each byte is used to represent a single character.
In order to visualize the idea of file formats, we will display a block of memory in the format shown below. This example shows the first 24 bytes from the PDF file for this book.
0 : 00100101 01010000 01000100 01000110 | %PDF 4 : 00101101 00110001 00101110 00110100 | -1.4 8 : 00001010 00110101 00100000 00110000 | .5 0 12 : 00100000 01101111 01100010 01101010 | obj 16 : 00001010 00111100 00111100 00100000 | .<< 20 : 00101111 01010011 00100000 00101111 | /S /
This display has three columns. On the left is a byte offset that indicates the memory location within the file for each row. The middle column displays the raw memory contents of the file, which is just a series of 0’s and 1’s. The right hand column displays an interpretation of the bytes. This display is split across several rows just so that it will fit onto the printed page. A block of computer memory is best thought of as one long line of 0’s and 1’s.
In this example, we are interpreting each byte of memory as a single character, so for each byte in the middle column, there is a corresponding character in the right-hand column. As specific examples, the first byte, 00100101, is being interpreted as the percent character, %, and and the second byte, 01010000, is being interpreted as the letter P.
In some cases, the byte of memory does not correspond to a printable character, and in those cases we just display a full stop. An example of this is byte number nine (the first byte on the third row of the display).
Because the binary code for computer memory takes up so much space, we will also sometimes display the central raw memory column using hexadecimal (base 16) code rather than binary. In this case, each byte of memory is just a pair of hexadecimal digits. The first 24 bytes of the PDF file for this book are shown again below, using hexadecimal code for the raw memory.
0 : 25 50 44 46 2d 31 2e 34 0a 35 20 30 | %PDF-1.4.5 0 12 : 20 6f 62 6a 0a 3c 3c 20 2f 53 20 2f | obj.<< /S /
We will now look at a low level at the surface temperature data for the Pacific Pole of Inaccessibility (see Section 1.1), which is in a plain text format. To emphasize the format of this information in computer memory, the first 48 bytes of the file are displayed below. This display should be compared with Figure 5.2, which shows what we would normally see when we view the plain text file in a text editor or web browser.
0 : 20 20 20 20 20 20 20 20 20 20 20 20 | 12 : 20 56 41 52 49 41 42 4c 45 20 3a 20 | VARIABLE : 24 : 4d 65 61 6e 20 54 53 20 66 72 6f 6d | Mean TS from 36 : 20 63 6c 65 61 72 20 73 6b 79 20 63 | clear sky c
This display clearly demonstrates that the Point Nemo information has been stored as a series of characters. The empty space at the start of the first line is a series of 13 spaces, with each space stored as a byte with the hexadecimal value 20. The letter V at the start of the word VARIABLE has been stored as a byte with the value 56.
To further emphasize the character-based nature of a plain text format, another part of the file is shown below as raw computer memory, this time focusing on the part of the file that contains the core data–the dates and temperature values.
336 : 20 31 36 2d 4a 41 4e 2d 31 39 39 34 20 30 30 | 16-JAN-1994 00 351 : 20 2f 20 20 31 3a 20 20 32 37 38 2e 39 0d 0a | / 1: 278.9.. 366 : 20 31 36 2d 46 45 42 2d 31 39 39 34 20 30 30 | 16-FEB-1994 00 381 : 20 2f 20 20 32 3a 20 20 32 38 30 2e 30 0d 0a | / 2: 280.0..
The second line of this display shows that the number 278.9 is stored in this file as five characters–the digits 2, 7, 8, followed by a full stop, then the digit 9–with one byte per character. Another small detail that may not have been clear from previous views of these data is that each line starts with a space, represented by a byte with the value 20.
We will contrast this sort of format with other ways of storing the information later in Section 5.3. For now, we just need to be aware of the simplicity of the memory usage in such a plain text format and the fact that everything is stored as a series of characters in a plain text format.
The next section will look at why these features can be both a blessing and a curse.
The main advantage of plain text formats is their simplicity: we do not require complex software to create or view a text file and we do not need esoteric skills beyond being able to type on a keyboard, which means that it is easy for people to view and modify the data.
The simplicity of plain text formats means that virtually all software packages can read and write text files and plain text files are portable across different computer platforms.
The main disadvantage of plain text formats is also their simplicity. The basic conceptual structure of rows of values can be very inefficient and inappropriate for data sets with any sort of complex structure.
The low-level format of storing everything as characters, with one byte per character, can also be very inefficient in terms of the amount of computer memory required.
Consider a data set collected on two families, as depicted in Figure 5.3. What would this look like as a plain text file, with one row for all of the information about each person in the data set? One possible fixed-width format is shown below. In this format, each row records the information for one person. For each person, there is a column for the father’s name (if known), a column for the mother’s name (if known), the person’s own name, his or her age, and his or her gender.
John 33 male Julia 32 female John Julia Jack 6 male John Julia Jill 4 female John Julia John jnr 2 male David 45 male Debbie 42 female David Debbie Donald 16 male David Debbie Dianne 12 female
This format for storing these data is not ideal for two reasons. Firstly, it is not efficient; the parent information is repeated over and over again. This repetition is also undesirable because it creates opportunities for errors and inconsistencies to creep in. Ideally, each individual piece of information would be stored exactly once; if more than one copy exists, then it is possible for the copies to disagree. The DRY principle (Section 2.7) applies to data as well as code.
The second problem is not as obvious, but is arguably much more important. The fundamental structure of most plain text file formats means that each line of the file contains exactly one record or case in the data set. This works well when a data set only contains information about one type of object, or, put another way, when the data set itself has a “flat” structure.
The data set of family members does not have a flat structure. There is information about two different types of object, parents and children, and these objects have a definite relationship between them. We can say that the data set is hierarchical or multi-level or stratified (as is obvious from the view of the data in Figure 5.3). Any data set that is obtained using a non-trivial study design is likely to have a hierarchical structure like this.
In other words, a plain text file format does not allow for sophisticated data models. A plain text format is unable to provide an appropriate representation of a complex data structure. Later sections will provide examples of storage formats that are capable of storing complex data structures.
Another major weakness of free-form text files is the lack of information within the file itself about the structure of the file. For example, plain text files do not usually contain information about which special character is being used to separate fields in a delimited file, or any information about the widths of fields with a fixed-width format. This means that the computer cannot automatically determine where different fields are within each row of a plain text file, or even how many fields there are.
A fixed-width format avoids this problem, but enforcing a fixed length for fields can create other difficulties if we do not know the maximum possible length for all variables. Also, if the values for a variable can have very different lengths, a fixed-width format can be inefficient because we store lots of empty space for short values.
The simplicity of plain text files make it easy for a computer to read a file as a series of characters, but the computer cannot easily distinguish individual data values from the series of characters. Even worse, the computer has no way of telling what sort of data is stored in each field. Does the series of characters represent a number, or text, or some more complex value such as a date?
In practice, a human must supply additional information about a plain text file before the computer can successfully determine where the different fields are within a plain text file and what sort of value is stored in each field.
MIME Type: text/plain
ID: text
The Text Plain format represents text as a string.
Note that DataWeave parses, encodes, and stores this format into RAM memory.
This example shows how DataWeave represents Text Plain data.
The Plain Text data serves as the input payload to the DataWeave source.
This is text plain
The DataWeave script transforms the Text PLain input payload to the DataWeave (dw) format and MIME type. It returns a string.
output application/dw --- payload
Because the DataWeave (dw) output is a string, it is wrapped in quotation marks.
"This is text plain"
DataWeave supports the following configuration properties for the Text Plain format.
There are no reader properties for Text Plain data.
The Text Plain format accepts properties that provide instructions for writing output data.
Parameter | Type | Default | Description |
---|---|---|---|
bufferSize | Number | 8192 | Size of the writer buffer. |
deferred | Boolean | false | When set to true , DataWeave generates the output as a data stream, and the script’s execution is deferred until it is consumed. Valid values are true or false . |
encoding | String | null | Encoding for the writer to use. |
The Text Plain format supports the following MIME types.
MIME Type |
---|
text/plain |
If you like my post please follow me to read my latest post on programming and technology.
Problem Statement: Given n pairs of parentheses, write a function to generate all combinations of well-formed parentheses. Example…
Given an integer A. Compute and return the square root of A. If A is…
Given a zero-based permutation nums (0-indexed), build an array ans of the same length where…
A heap is a specialized tree-based data structure that satisfies the heap property. It is…
What is the Lowest Common Ancestor? In a tree, the lowest common ancestor (LCA) of…