UNIXworkcode

1 2 What is setext 3 ---------------- 4 5 The following is extracted from text written by Ian Feldman. 6 7 As originally explained in TidBITS#100 and mentioned there from 8 now on, that publication now comes "wrapped as a setext." The noun 9 itself stands for both a method to wrap (format) texts according 10 to specific layout rules and for a single _structure_enhanced_ 11 text. The latter is a text which has been formatted in such a 12 fashion that it contains clues as to the typographical and logical 13 structure of its source (word-processed) document(s), if any. 14 Those clues, which are called "typotags," facilitate later automatic 15 detection of that structure so it can be validated, extracted, 16 processed, transformed, enhanced as needed, if needed. 17 18 It follows that setexts, being nothing but pure text (albeit with a 19 special layout), are eminently readable using ANY editor or word 20 processor in existence today or tommorrow, on any computer with a 21 computer program that is capable of opening and reading text files. 22 By default all properly setext-ized files will have an ".etx" or 23 ".ETX" suffix. This stands for an "emailable/ enhanced text", the 24 ExtraTerrestrial overtones nothwistanding ;-)) 25 26 Unlike other forms of text encoding that use explicit, visible tag 27 elements such as <this> and <\that>, the setext format relies 28 solely on the presence of _implicit_ typotags, carefully chosen 29 to be as visually unobtrusive as possible. The underlined word 30 above is one such instance of the defacto "invisible" coding. 31 Inserted typotags will at worst appear as mere "typos" in the text. 32 33 [Extensions made to the original set of typotags have muddied this 34 clarity a little bit, but they were necessary for NEdit development.] 35 36 Similarly, just to give an example, here is a short description 37 of the four types of word emphasis typotags that setexts MAY 38 contain, limited to one emphasis type ONLY per word or word group: 39 40 ------------------- ---------------------------- -------------- 41 ! **aBoldWord** **multiple bold words** ; bold-tt 42 !_anUnderlinedWord_ _multiple_underlined_words_ ; underline-tt 43 ! ~anItalicWord~ ~multiple italicized words~ ; italic-tt 44 ! aHotWord_ multiple_hot_words_ ; hot-tt 45 ----------------------------------------------------------------- 46 47 What makes a setext? 48 --------------------- 49 50 Before any decoding can take place a text has first to be 51 verified whether it is a setext and not some arbitrarily-wrapped 52 stream of characters. Although there are more ways than one to 53 achieve that goal there is one _primary_ test that has to be 54 passed with colors or else the text being tested cannot be a 55 setext. 56 57 Chief among the typotags are two that signal presence of setext 58 titles and subheads inside the text. A setext document can be 59 formatted more or less properly, may contain or lack any other of 60 its "native" elements but it has to have at least one proper 61 subhead or a title in order to be declared as "a certified 62 setext." 63 64 Column 1 of text line 65 | 66 V 67 Here are a few demo setext subheads: 68 ------------------------------------ 69 70 _ _ _ _ Which Share Just One _ _ _ _ 71 ------------------------------------ 72 73 ----------> UnifyinG FeaturE 74 ------------------------------------ 75 76 of EQUAL RIGHTMOST VISIBLE character 77 ------------------------------------ 78 79 length as that of its subhead-tt's 80 ------------------------------------ 81 82 [this line is called subhead-string] 83 ------------------------------------ 84 85 [the one below is called subhead-tt] 86 ------------------------------------ 87 88 [together they make a valid subhead] 89 ------------------------------------ 90 91 (!) and of course, subheads do not have to be of the same length ;-) 92 ----------------------------------------------------------------------- 93 94 (nor have to begin in column 1) 95 --------------------------------- 96 97 although it is recommended that they stay below 40 characters 98 -------------------------------------------------------------- 99 100 Second Setext In This File 101 ============================== 102 103 ((end of examples)) 104 ------------------- 105 ((_not_ a subhead)) 106 ^ 107 | 108 Column 1 of text line 109 110 Note, the last 3 lines of the examples do not constitute a valid 111 subhead because they do not start in column 1. 112 113 Chief among the reasons why one should first look for presence of 114 subheads rather than titles is that it is fully conceivable that a 115 setext might have been created without an explicit title-tt in 116 order to allow decoder programs to distinguish between part one 117 and any subsequent ones in a possible multi-part mailing. This 118 absence of a title-tt could be enough of a signal to start looking 119 for possible "part x of y" message in either the subject line, 120 filename or anywhere "above" the first detected subhead of the 121 current text. 122 123 Therefore, here's a formal definition of what makes a setext: 124 125 +-------------------------------------------------------------+ 126 | a text that contains at least one verified setext subhead | 127 | or setext title | 128 +-------------------------------------------------------------+ 129 130 Other considerations 131 --------------------- 132 133 A possibility arises to keep the paragraph text unwrapped, rather 134 than folded uniformly at say the 66th character mark. After all, 135 if the setext is primarily to be displayed inside an editor, 136 rather than on an 80 character terminal screen, then there is not 137 much sense in prior folding of the lines to a specific 138 guaranteed-to-fit-on-a-TTY-screen length. The editor/word 139 processor program will fit the unwrapped text to the available 140 display area, and might actually prefer to have to deal with 141 whole unwrapped paragraphs rather than with otherwise relatively 142 short lines. 143 144 Most text-processing programs with native word-wrap capabilities 145 actually consider return-terminated lines to be paragraphs in 146 their own right. Thus, if a setext is not to travel via email 147 anyway (because of it being distributed differently or making use 148 of accented characters) then it might as well arrive in unfolded 149 state so that no extra time need be spent on making the 150 paragraphs "whole again." [This is not the choice that is taken 151 with NEdit help because it is easier to visualize the final text 152 for those who do not use text wrapping.] 153 154 Observe that it is not the state of the paragraph text that makes 155 or breaks a setext. No, the sole criterion of whether a text is 156 a setext is the presence of at least one verified subhead, as 157 described above. Thus even texts with unfolded paragraphs are 158 setexts if they contain at least one subhead-tt. 159 160 The sole mechanism used in setext to encode which of such lines 161 are in reality paragraphs (as opposed to those that shouldn't be 162 folded mechanically) is the character indent. In fact, after the 163 subhead-tt the second most important typotag is the indent-tt, 164 made up of exactly two space characters, which denotes any such 165 indented lines as ready-candidates for reflowing by so inclined 166 front-ends (either on their own or as part of like-indented lines 167 above and below it). So any potentially long line of a setext 168 that has been indent-tted will be understood (by any validated 169 setext front-end) as to be ready for wrapping-to-length if so 170 required. 171 172 .. All the following document by Steven Haehn 173 174 Typotags Available 175 ------------------ 176 177 The following table contains typotags recognized by the setext 178 utility. The "setext form" column in the table is formatted such 179 that the left most character of the column represents the first 180 character in a line of setext. The circumflex character (^) means 181 that the characters of the typotag are significant only when they 182 are anchored to the front of the setext line. Typotags marked 183 with an asterisk (*) are extensions added for NEdit help 184 generation. 185 186 !! ============ =================== ================== 187 !! name of setext form acted upon or 188 !! the typotag of typotag displayed as 189 !! ============ =================== ================== 190 !! title-tt "Title a title 191 !! =====" in chosen style 192 !! ------------ ------------------- ------------------ 193 !! subhead-tt "Subhead a subhead 194 !! -------" in chosen style 195 !! ------------ ------------------- ------------------ 196 !! section-tt ^#> section-text a section heading 197 !! with '#' from 1..9 198 !! in chosen style 199 !! ------------ ------------------- ------------------ 200 !! indent-tt ^ lines indented lines undented 201 !! ^ by 2 spaces and unfolded 202 !! ------------ ------------------- ------------------ 203 !! bold-tt **[multi]word** 1+ bold word(s) 204 !! italic-tt ~multi word~ 1+ italic word(s) 205 !! underline-tt [_multi]_word_ underlined text 206 !! hot-tt [multi_]word_ 1+ hot word(s) 207 !! quote-tt ^>[space][text] > [mono-spaced] 208 !! bullet-tt ^*[space][text] [bullet] [text] 209 !! untouch-tt `_quoted typotag!_` `_left alone!_` 210 !! notouch-tt* ^!followed by text text-left-alone 211 !! field-tt* |>name[=value]<| value of name 212 !! line-tt* ^ --- horizontal rule 213 !! ------------ ------------------- ------------------ 214 !! href-tt* ^.. _word URL jump to address 215 !! note-tt ^.. _word Note:("*") ("cause error") 216 !! target-tt* _[multi_]word [multi ]word 217 !! ------------ ------------------- ------------------ 218 !! twobuck-tt $$ [last on a line] [parse another] 219 !! suppress-tt ^..[space][not dot] [line hidden] 220 !! twodot-tt ^..[alone on a line] [taken note of] 221 !! ------------ ------------------- ------------------ 222 !! maybe-tt* ^.. ? name[~] text show text when 223 !! name defined 224 !! maybenot-tt* ^.. ! name[~] text show text when 225 !! name NOT defined 226 !! endmaybe-tt* ^.. ~ name end of a multi- 227 !! line maybe[not]-tt 228 !! ------------ ------------------- ------------------ 229 !! passthru-tt* ^!![text] text emitted 230 !! without processing 231 !! ------------ ------------------- ------------------ 232 !! escape-tt* @x where 'x' is x is what remains 233 !! escaped character @@ needed for 1 @ 234 !! ============ =================== ================== 235 !! 236 237 The title-tt, subhead-tt and indent-tt have already been 238 discussed in length in the previous sections. All typotag 239 elements, but the subhead-tt, are optional, that is, not 240 necessary for a setext to be declared as such. The simple 241 character marking typotags, bold-tt, italic-tt, and underline-tt 242 have been used throughout the document and are used to mark text 243 with their obvious meanings. 244 245 3>Section-tt (document divisions) 246 247 The section-tt allows subdividing of the setext into further 248 subsections for greater nesting capability. Typical usage starts 249 the numbering level at 3 because the title-tt and subhead-tt 250 basically represent sections 1 and 2, respectively. 251 252 3>Bullet-tt (list marker) 253 254 The bullet-tt typotag is use to create a list of items. Note that 255 it can only be used to create single line entries, like the 256 following: 257 258 Column 1 of text line 259 | 260 V 261 * This is the first bullet. 262 * This is the second bullet. 263 264 Remember that you have to insert empty lines immediately before 265 and after the bullet list. 266 267 3>Untouch-tt, Notouch-tt, Passthru-tt, Escape-tt (quoting text) 268 269 Each one of these leave-my-text-alone typotags offer varying 270 degrees of operation. The untouch-tt surrounds the text that 271 is not to be interpreted. The accent grave (`) character is 272 used to start and finish the untouchable text. (An extension 273 to this has been allowed in the setext utility. An untouch-tt 274 may be terminated by an apostrophe (').) The following are 275 all valid untouch-tt typotags. 276 277 `this is the _original_ version of the untouch-tt` 278 `this is the _extended_ form of the untouch-tt' 279 `This couldn't _be_ a problem could it?' 280 281 Note that the third example has used the contraction "couldn't" 282 which did not terminate the untouch-tt because the apostrophe was 283 not followed by whitespace or punctuation. 284 285 The notouch-tt typotag is used to take care of entire lines of 286 text. The difference between this and the untouch-tt is that there 287 is no visual residual typotag mark left in the output. It is 288 replaced by a space. For example, 289 290 Column 1 of text line 291 | 292 V 293 ! This line of text will look like this sans the ! in column 1. 294 295 becomes, 296 297 This line of text will look like this sans the ! in column 1. 298 299 The difference between the passthru-tt and the notouch-tt is 300 the subtle difference of not replacing the markers with space, but 301 totally removing them. (The original usage was to try to emit 302 special 'C' compiler directives directly into the help code 303 product). Thus, 304 305 Column 1 of text line 306 | 307 V 308 !!#ifdef VMS 309 310 would turn into 311 312 #ifdef VMS 313 314 The escape-tt (@) is used to escape the special markers of 315 the other typotags and itself. Here is an example of escaping 316 itself. 317 318 develop@@nedit.org 319 320 This will become "develop@nedit.org" in resulting documents. 321 322 323 3>Suppress-tt, Twodot-tt (author annotations or comments) 324 325 The suppress-tt typotag allows an author to place annotations in a 326 setext document which will not appear in a generated product. Most 327 of the extensions to the original setext definition were placed 328 inside this form of typotag. 329 330 Column 1 of text line 331 | 332 V 333 .. This is a document comment that would normally disappear 334 .. from generated text, html, or the like. These lines are 335 .. what constitute a suppress-tt. The following line is the 336 .. twodot-tt. 337 .. 338 339 3>Hot-tt, Href-tt, Target-tt (hyperlinking text) 340 341 These three typotags are used in conjunction to create 342 hypertext reference mechanism used int HTML and NEdit 343 help code generation. The hot-tt is an original typotag which 344 needed the additional two tags to be able create actual hyperlinks 345 to other sections of the document, or to external references that 346 could be exploited. These tags are ignored (stripped) when 347 generating simple text documents. 348 349 The hot-tt typotag is used to mark the text which would be used as 350 the doorway to accessing other parts of the document. It either 351 references a title or subhead string directly, or an href-tt. An 352 href-tt (hypertext reference typotag) is used as an intermediary 353 for the hyperlink destination. Its value either specifies an 354 external document reference, or an internal document reference. 355 The target-tt is used to mark the internal document references 356 mentioned in a href-tt. 357 358 Now for some examples. All the marked text will be inside 359 parenthesis so it will stand out as to what explicitly is being 360 marked. 361 362 This hot-tt directly references the (Typotags_Available_) 363 subheading above. Whereas, the following hot-tt (references_) 364 the href-tt marked by this target-tt (_typotag). 365 366 Here is what the href-tt would look like: 367 368 Column 1 of text line 369 | 370 V 371 ! .. _references #typotsg 372 373 .. The following line is the actual hypertext reference in this 374 .. document. This annotation is an example of supress-tt usage. 375 .. _references #typotag 376 377 3>Maybe[not]-tt, Endmaybe-tt (conditional text regions) 378 379 Multiple line maybe-tt or maybenot-tt (conditional text regions) 380 are introduced as follows: 381 382 Column 1 of text line 383 | 384 V 385 .. ? name~ (this is the maybe-tt) 386 .. ! name~ (this is the maybenot-tt) 387 388 Both are terminated with an endmaybe-tt on a separate line. 389 390 Column 1 of text line 391 | 392 V 393 .. ~ name 394 395 The name* of the conditional region is left up to the text 396 author. Single line maybe[not]-tt typotags do not use the '~' 397 character at the end of the name and are terminated at the end 398 of the line. 399 400 Column 1 of text line 401 | 402 V 403 .. ? oneLine (This is a one line maybe-tt) 404 .. ! oneLine (This is a one line maybenot-tt) 405 406 * There are some predefined conditional region names that are 407 already known to the setext parser: html, text, and (NEdit) help. 408 The special conditional text region named "html" allows a mixture 409 of setext and HTML tags. 410 411 Nesting of conditional text regions is allowed. For instance, if 412 there are three conditional regions, A, B, and C, C can be nested 413 inside B, which can be nested inside A. For example, 414 A-B-C...C-B-A. 415 416 Column 1 of text line 417 | 418 V 419 .. ? A~ Example of legally nested conditional text regions 420 .. ? B~ 421 .. ? C~ 422 .. ~ C 423 .. ~ B 424 .. ~ A 425 426 Note that a surrounding region cannot end before one of its inner 427 regions is terminated (eg. of illegal nesting A-B-C...C-A-B, 428 where A terminated prior to B). 429 430 3>Field-tt (variable definition and substitution) 431 432 Field-tt typotags are used to define variables and reference 433 their values. Field definitions can only occur within a 434 suppress-tt. 435 436 For example, to define the variable 'author' and fill it 437 with the value "Steven Haehn": 438 439 Column 1 of text line 440 | 441 V 442 .. |>author=Steven Haehn<| 443 444 To use the value of the defined variable, place the field-tt, 445 |>author<|, in any printable text region. If there is no known 446 value for the field, it will remain unchanged and appear as 447 written in the setext. 448 449 The following are predefined for use in a field-tt 450 for any setext document translated by the setext utility. 451 452 Date = <MonthName day, year> (eg. December 6, 2001) 453 date = <MonthAbbreviation day, year> (eg. Dec 6, 2001) 454 year = <year> (eg. 2001) 455 456 3>Line-tt (horizontal rule demarcation) 457 458 This typotag is used to place horizontal markers into generated 459 text documents. Like the following. 460 461 Column 4 of text line 462 | 463 V 464 ------------------------------------------------------------- 465 466 3>Twobuck-tt (setext termination marker) 467 468 This typotag is used to mark the end of document parsing. 469 470 $$ 471 472 $Id: setext-info.txt,v 1.3 2002/09/26 12:37:38 ajhood Exp $ 473