How To Transform HTML To Textile Markup - The CakePHP TextileHelper Revisited

Posted by Tim Koschützki, on Aug 23, 2007 - in PHP & CakePHP » Views & Helpers

Hi folks. For a current project of mine I had to find a way to decode html into textile markup. Why? Because we are using tinyMCE to process our textareas as wyciwyg editors, which generate HTML. However, we want all output controlled via textile to allow only the textile tags. Yes, we could do it with strip_tags(), but textile is much more elegant. Plus, it was a requirement by the client. Come on and find out how to detextile html.

How Does It Work?

Well the code is not entirely trivial, but it looks like what you would expect: a bunch of regex processing. Here is the basic detextile() method, which takes some (html) text:

php
  1. function detextile($text) {
  2.    
  3.     $text = preg_replace("/<br \/>\s*/","\n",$text);
  4.  
  5.     $oktags = array('p','ol','ul','li','i','b','em','strong','span','a','h[1-6]',
  6.       'table','tr','td','u','del','sup','sub','blockquote');
  7.  
  8.     foreach($oktags as $tag){
  9.       $text = preg_replace_callback("/\t*< (".$tag.")\s*([^>]*)>(.*)< \/\\1>/Usi",
  10.       array($this,'processTag'),$text);
  11.     }
  12.  
  13.     $text = $this->detextile_process_glyphs($text);
  14.     $text = $this->detextile_process_lists($text);
  15.        
  16.         $text = preg_replace('/^\t* *p\. /m','',$text);
  17.        
  18.         return $this->decode_high($text);
  19.     }

Okay, so we processTag() all html tags that we want to cover, process glyphs (we will get to that in a minute) and lists, eliminate all tabs and paragraphs and return the text decoded, with UTF8 as the standard charset making use of the mb_decode_numericentity() function. So what does processTag do?

php
  1. function processTag($matches) {
  2.         list($all,$tag,$atts,$content) = $matches;
  3.         $a = $this->splat($atts);
  4.  
  5.         $phr = array(
  6.         'em'=>'_',
  7.         'i'=>'__',
  8.         'b'=>'**',
  9.         'strong'=>'*',
  10.         'cite'=>'??',
  11.         'del'=>'-',
  12.         'ins'=>'+',
  13.         'sup'=>'^',
  14.         'sub'=>'~',
  15.         'span'=>'%');
  16.        
  17.         $blk = array('p','h1','h2','h3','h4','h5','h6');
  18.  
  19.         if(isset($phr[$tag])) {
  20.             return $phr[$tag].$this->sci($a).$content.$phr[$tag];
  21.         } elseif($tag=='blockquote') {
  22.             return 'bq.'.$this->sci($a).' '.$content;
  23.         } elseif(in_array($tag,$blk)) {
  24.             return $tag.$this->sci($a).'. '.$content;
  25.         } elseif ($tag=='a') {
  26.             $t = $this->filterAtts($a,array('href','title'));
  27.             $out = '"'.$content;
  28.             $out.= (isset($t['title'])) ? ' ('.$t['title'].')' : '';
  29.             $out.= '":'.$t['href'];
  30.             return $out;
  31.         } else {
  32.             return $all;
  33.         }
  34.     }
  35.  
  36.     function sci($a)
  37.     {
  38.         $out = '';
  39.         foreach($a as $t){
  40.             $out.= ($t['name']=='class') ? '(='.$t['att'].')' : '';
  41.             $out.= ($t['name']=='id') ? '[='.$t['att'].']' : '';
  42.             $out.= ($t['name']=='style') ? '{='.$t['att'].'}' : '';
  43.             $out.= ($t['name']=='cite') ? ':'.$t['att'] : '';
  44.         }
  45.         return $out;
  46.     }

Here is where much of the converting takes place. We have a map of conversion strings from html tags to textile entities and convert them here. We preserve any classes, ids and attributes using the splat method and the sci method and return the text. The splat method is quite sophisticated and long to explain, but it should become clear when you look at it below.

Now on to the glyphs and the list methods:

php
  1. function detextile_process_glyphs($text) {
  2.     $glyphs = array(  
  3.       '&#8217;'=>'\'',        # single closing
  4.       '&#8216;'=>'\'',        # single opening
  5.       '&#8221;'=>'"',         # double closing
  6.       '&#8220;'=>'"',         # double opening
  7.       '&#8212;'=>'--',        # em dash
  8.       '&#8211;'=>' - ',       # en dash
  9.       '&#215;' =>'x',         # dimension sign
  10.       '&#8482;'=>'(TM)',      # trademark
  11.       '&#174;' =>'(R)',       # registered
  12.       '&#169;' =>'(C)',       # copyright
  13.       '&#8230;'=>'...'        # ellipsis
  14.     );
  15.  
  16.     foreach($glyphs as $f=>$r){
  17.       $text = str_replace($f,$r,$text);
  18.     }
  19.     return $text;
  20.   }

Easy. It simply converts some html entities for glyphs into their textile equivalents.

The list method:

php
  1. function detextile_process_lists($text) {
  2.     $list = false;
  3.  
  4.     $text = preg_split("/(< .*>)/U",$text,-1,PREG_SPLIT_DELIM_CAPTURE);
  5.     foreach($text as $line){
  6.  
  7.       if ($list == false && preg_match('/<ol /',$line)){
  8.         $line = "";
  9.         $list = "o";
  10.       } else if (preg_match('/<\/ol/',$line)){
  11.         $line = "";
  12.         $list = false;
  13.       } else if ($list == false && preg_match('/<ul/',$line)){
  14.         $line = "";
  15.         $list = "u";
  16.       } else if (preg_match('/<\/ul/',$line)){
  17.         $line = "";
  18.         $list = false;
  19.       } else if ($list == 'o'){
  20.         $line = preg_replace('/<li.*>/U','# ', $line);
  21.       } else if ($list == 'u'){
  22.         $line = preg_replace('/<li .*>/U','* ', $line);
  23.       }
  24.       $glyph_out[] = $line;
  25.     }
  26.  
  27.     return $text = implode('',$glyph_out);
  28.   }

This method is a bit more tricky. It wipes out any list starting tags (ul, ol) and converts all li-tags into their textile equivalents - either "# " (for ordered lists) or " *" (for unordered lists).

How Do You Use The Code?

Using the code is darn easy. You just invoke the detextile method upon your html code:

php
  1.      echo $textile->detextile($htmlText);

An Example

Here is some example html code we want to convert:

html
  1.                 <strong>This is some bold text</strong>
  2.               </p>
  3.               <p>
  4.                  <em>This is italic text</em>
  5.               </p>
  6.               <p>
  7.                  
  8.  
  9.               </p>
  10.               <p>
  11.                 <u>Underline text man</u>
  12.               </p>
  13.               <p>
  14.                  
  15.               </p>
  16.               <ul>
  17.  
  18.                 <li>ul list item1
  19.                 </li>
  20.                 <li>ul list item2
  21.                 </li>
  22.                 <li>ul list item3
  23.                 </li>
  24.               </ul>
  25.               <ol>
  26.                 <li>ol list item1
  27.                 </li>
  28.                 <li>ol list item2
  29.                 </li>
  30.  
  31.                 <li>ol list item3
  32.                 </li>
  33.               </ol>

Detextile output:

html
  1. *This is some bold text*p.  _This is italic text_p.  p. <u>Underline text man</u>p.  * ul list item1 * ul list item2* ul list item3# ol list item1# ol list item2# ol list
  2.               item3

Cool! And it was easy as well!

Get The Code

Here are all methods for your cakephp textile helper. You can plug them in into any Textile Helper for other frameworks of course:

php
  1. // -------------------------------------------------------------
  2. // The following functions are used to detextile html, a process
  3. // still in development.
  4. // By Tim Koschützki
  5.  
  6. // Based on code from http://www.aquarionics.com
  7.  
  8. // -------------------------------------------------------------
  9.     function detextile($text) {
  10.    
  11.     $text = preg_replace("/<br \/>\s*/","\n",$text);
  12.  
  13.     $oktags = array('p','ol','ul','li','i','b','em','strong','span','a','h[1-6]',
  14.       'table','tr','td','u','del','sup','sub','blockquote');
  15.  
  16.     foreach($oktags as $tag){
  17.       $text = preg_replace_callback("/\t*< (".$tag.")\s*([^>]*)>(.*)< \/\\1>/Usi",
  18.       array($this,'processTag'),$text);
  19.     }
  20.  
  21.     $text = $this->detextile_process_glyphs($text);
  22.     $text = $this->detextile_process_lists($text);
  23.        
  24.         $text = preg_replace('/^\t* *p\. /m','',$text);
  25.        
  26.         return $this->decode_high($text);
  27.     }
  28.  
  29.   function detextile_process_glyphs($text) {
  30.     $glyphs = array(  
  31.       '&#8217;'=>'\'',        # single closing
  32.       '&#8216;'=>'\'',        # single opening
  33.       '&#8221;'=>'"',         # double closing
  34.       '&#8220;'=>'"',         # double opening
  35.       '&#8212;'=>'--',        # em dash
  36.       '&#8211;'=>' - ',       # en dash
  37.       '&#215;' =>'x',         # dimension sign
  38.       '&#8482;'=>'(TM)',      # trademark
  39.       '&#174;' =>'(R)',       # registered
  40.       '&#169;' =>'(C)',       # copyright
  41.       '&#8230;'=>'...'        # ellipsis
  42.     );
  43.  
  44.     foreach($glyphs as $f=>$r){
  45.       $text = str_replace($f,$r,$text);
  46.     }
  47.     return $text;
  48.   }
  49.  
  50.   function detextile_process_lists($text) {
  51.     $list = false;
  52.  
  53.     $text = preg_split("/(< .*>)/U",$text,-1,PREG_SPLIT_DELIM_CAPTURE);
  54.     foreach($text as $line){
  55.  
  56.       if ($list == false && preg_match('/<ol /',$line)){
  57.         $line = "";
  58.         $list = "o";
  59.       } else if (preg_match('/<\/ol/',$line)){
  60.         $line = "";
  61.         $list = false;
  62.       } else if ($list == false && preg_match('/<ul/',$line)){
  63.         $line = "";
  64.         $list = "u";
  65.       } else if (preg_match('/<\/ul/',$line)){
  66.         $line = "";
  67.         $list = false;
  68.       } else if ($list == 'o'){
  69.         $line = preg_replace('/<li.*>/U','# ', $line);
  70.       } else if ($list == 'u'){
  71.         $line = preg_replace('/<li .*>/U','* ', $line);
  72.       }
  73.       $glyph_out[] = $line;
  74.     }
  75.  
  76.     return $text = implode('',$glyph_out);
  77.   }
  78.  
  79.   function processTag($matches) {
  80.         list($all,$tag,$atts,$content) = $matches;
  81.         $a = $this->splat($atts);
  82.  
  83.         $phr = array(
  84.         'em'=>'_',
  85.         'i'=>'__',
  86.         'b'=>'**',
  87.         'strong'=>'*',
  88.         'cite'=>'??',
  89.         'del'=>'-',
  90.         'ins'=>'+',
  91.         'sup'=>'^',
  92.         'sub'=>'~',
  93.         'span'=>'%');
  94.        
  95.         $blk = array('p','h1','h2','h3','h4','h5','h6');
  96.  
  97.         if(isset($phr[$tag])) {
  98.             return $phr[$tag].$this->sci($a).$content.$phr[$tag];
  99.         } elseif($tag=='blockquote') {
  100.             return 'bq.'.$this->sci($a).' '.$content;
  101.         } elseif(in_array($tag,$blk)) {
  102.             return $tag.$this->sci($a).'. '.$content;
  103.         } elseif ($tag=='a') {
  104.             $t = $this->filterAtts($a,array('href','title'));
  105.             $out = '"'.$content;
  106.             $out.= (isset($t['title'])) ? ' ('.$t['title'].')' : '';
  107.             $out.= '":'.$t['href'];
  108.             return $out;
  109.         } else {
  110.             return $all;
  111.         }
  112.     }
  113.  
  114. // -------------------------------------------------------------
  115.     function filterAtts($atts,$ok)
  116.     {
  117.         foreach($atts as $a) {
  118.             if(in_array($a['name'],$ok)) {
  119.                 if($a['att']!='') {
  120.                 $out[$a['name']] = $a['att'];
  121.                 }
  122.             }
  123.         }
  124. #        dump($out);
  125.         return $out;
  126.     }
  127.  
  128. // -------------------------------------------------------------
  129.     function sci($a)
  130.     {
  131.         $out = '';
  132.         foreach($a as $t){
  133.             $out.= ($t['name']=='class') ? '(='.$t['att'].')' : '';
  134.             $out.= ($t['name']=='id') ? '[='.$t['att'].']' : '';
  135.             $out.= ($t['name']=='style') ? '{='.$t['att'].'}' : '';
  136.             $out.= ($t['name']=='cite') ? ':'.$t['att'] : '';
  137.         }
  138.         return $out;
  139.     }
  140.  
  141. // -------------------------------------------------------------
  142.     function splat($attr)  // returns attributes as an array
  143.     {
  144.         $arr = array();
  145.         $atnm = '';
  146.         $mode = 0;
  147.  
  148.         while (strlen($attr) != 0){
  149.             $ok = 0;
  150.             switch ($mode) {
  151.                 case 0: // name
  152.                     if (preg_match('/^([a-z]+)/i', $attr, $match)) {
  153.                         $atnm = $match[1]; $ok = $mode = 1;
  154.                         $attr = preg_replace('/^[a-z]+/i', '', $attr);
  155.                     }
  156.                 break;
  157.    
  158.                 case 1: // =
  159.                     if (preg_match('/^\s*=\s*/', $attr)) {
  160.                         $ok = 1; $mode = 2;
  161.                         $attr = preg_replace('/^\s*=\s*/', '', $attr);
  162.                     break;
  163.                     }
  164.                     if (preg_match('/^\s+/', $attr)) {
  165.                         $ok = 1; $mode = 0;
  166.                         $arr[] = array('name'=>$atnm,'whole'=>$atnm,'att'=>$atnm);
  167.                         $attr = preg_replace('/^\s+/', '', $attr);
  168.                     }
  169.                 break;
  170.    
  171.                 case 2: // value
  172.                     if (preg_match('/^("[^"]*")(\s+|$)/', $attr, $match)) {
  173.                         $arr[]=array('name' =>$atnm,'whole'=>$atnm.'='.$match[1],
  174.                                 'att'=>str_replace('"','',$match[1]));
  175.                         $ok = 1; $mode = 0;
  176.                         $attr = preg_replace('/^"[^"]*"(\s+|$)/', '', $attr);
  177.                     break;
  178.                     }
  179.                     if (preg_match("/^('[^']*')(\s+|$)/", $attr, $match)) {
  180.                         $arr[]=array('name' =>$atnm,'whole'=>$atnm.'='.$match[1],
  181.                                 'att'=>str_replace("'",'',$match[1]));
  182.                         $ok = 1; $mode = 0;
  183.                         $attr = preg_replace("/^'[^']*'(\s+|$)/", '', $attr);
  184.                     break;
  185.                     }
  186.                     if (preg_match("/^(\w+)(\s+|$)/", $attr, $match)) {
  187.                         $arr[]=
  188.                             array('name'=>$atnm,'whole'=>$atnm.'="'.$match[1].'"',
  189.                                 'att'=>$match[1]);
  190.                         $ok = 1; $mode = 0;
  191.                         $attr = preg_replace("/^\w+(\s+|$)/", '', $attr);
  192.                     }
  193.                 break;
  194.             }
  195.             if ($ok == 0){
  196.                 $attr = preg_replace('/^\S*\s*/', '', $attr);
  197.                 $mode = 0;
  198.             }
  199.         }
  200.         if ($mode == 1) $arr[] =
  201.                 array ('name'=>$atnm,'whole'=>$atnm.'="'.$atnm.'"','att'=>$atnm);
  202.      
  203.         return $arr;
  204.     }

The code is based on an unfinished start from http://www.aquarionics.com. Thanks to the guys over there!

Have fun!

Print this Post | Digg This | Stumble It | Delicious

12 Comments

[...] Koschuetzki has a new tutorial posted today for CakePHP users out there - it’s a method for transforming HTML content into [...]

Paul on Sep 12, 2007:

this works great. thank you thank you thank you for sharing!!! i was dreading the regular expressions.

Tim Koschuetzki on Sep 13, 2007:

Np, Paul. :]

[...] You can read more here [...]

zollerwagner on Mar 02, 2008:

This is a great help, Tim. Can you explain how you'd use this to convert an entire site?

Would you run it page by page, copying and pasting each page's HTML into a new PHP page based on this snippet:
detextile($htmlText);
?>

Or have you found a way to automate the process?

Tim Koschuetzki on Mar 02, 2008:

So you have an entire arsenal of html pages that you want to textilize? Oh well that should be very easy to automate.

Depending on how your pages are represented (in the file system or in the dbd or wherever) you might use file_get_contents() or some database fetching. Invoke detextile() on the result and off you go. : )

You can use the glob() function or, what I prefer, the Folder class of CakePHP, to read all file names from a given folder (also recursively) and then use file_get_contents().

Or did you mean something else?

zollerwagner on Mar 02, 2008:

Wow, that was a quick response! Thanks.

Yes, that does help.

(I'll try to post my code with broken tags and a bbcode-like container.)

To test, I'm trying to set up an HTML form into which I can past HTML to be converted. In the body I have this:

[code]
< form action="" method="post" id="detextile">

Original HTML

Conversion to Textile

[/code]

Then above the head I have this:

[code]
if (isset($_POST['htmlText']))
{
echo $textile->detextile($_POST['htmlText']);
}
[/code]

The script gets as far as the if then stops. What do I have to do differently to invoke your textile function? Man, I must be missing something really basic!

zollerwagner on Mar 02, 2008:

Maybe I over did it. I'll submit the code again.

[code]<form action="" method="post" id="detextile">
Original HTML

Conversion to Textile

[/code]

The HTML form
[code]
if (isset($_POST['htmlText']))
{
echo $textile->detextile($_POST['htmlText']);
}
[/code]

zollerwagner on Mar 02, 2008:

Darn. The HTML is still not showing. Feel free to delete that and I'll try again. I looks like youre running WP, so maybe will work.

<form action="" method="post" id="detextile">

the HTML
Original HTML

Conversion to Textile

Invoking your Textile function
if (isset($_POST['htmlText']))
{
echo $textile->detextile($_POST['htmlText']);
}

zollerwagner on Mar 02, 2008:

Sorry about that. What's the trick for showing HTML?

Tim Koschuetzki on Mar 03, 2008:

Yeah, html sowing is a little broken. Gotta fix that up. : )

So hrm, o you have a textarea that has the htmltext id or name? I don't see that in your form.. so the if cannot succeed.

Riyasha on Sep 16, 2008:

listing inside listing is not working..

# test
## item1
## item2
## item3
# test2
## new item
## new item1

Add a comment