How To Transform HTML To Textile Markup - The CakePHP TextileHelper Revisited
Posted by Tim Koschützki, on Aug 23, 2007 - in PHP & CakePHP » Views & Helpers
Hi folks. For a current project of mine I had to find a way to decode html into textile markup. Why? Because we are using tinyMCE to process our textareas as wyciwyg editors, which generate HTML. However, we want all output controlled via textile to allow only the textile tags. Yes, we could do it with strip_tags(), but textile is much more elegant. Plus, it was a requirement by the client. Come on and find out how to detextile html.
How Does It Work?
Well the code is not entirely trivial, but it looks like what you would expect: a bunch of regex processing. Here is the basic detextile() method, which takes some (html) text:
-
function detextile($text) {
-
-
-
'table','tr','td','u','del','sup','sub','blockquote');
-
-
foreach($oktags as $tag){
-
}
-
-
$text = $this->detextile_process_glyphs($text);
-
$text = $this->detextile_process_lists($text);
-
-
-
return $this->decode_high($text);
-
}
Okay, so we processTag() all html tags that we want to cover, process glyphs (we will get to that in a minute) and lists, eliminate all tabs and paragraphs and return the text decoded, with UTF8 as the standard charset making use of the mb_decode_numericentity() function. So what does processTag do?
-
function processTag($matches) {
-
$a = $this->splat($atts);
-
-
'em'=>'_',
-
'i'=>'__',
-
'b'=>'**',
-
'strong'=>'*',
-
'cite'=>'??',
-
'del'=>'-',
-
'ins'=>'+',
-
'sup'=>'^',
-
'sub'=>'~',
-
'span'=>'%');
-
-
-
return $phr[$tag].$this->sci($a).$content.$phr[$tag];
-
} elseif($tag=='blockquote') {
-
return 'bq.'.$this->sci($a).' '.$content;
-
return $tag.$this->sci($a).'. '.$content;
-
} elseif ($tag=='a') {
-
$out = '"'.$content;
-
$out.= '":'.$t['href'];
-
return $out;
-
} else {
-
return $all;
-
}
-
}
-
-
function sci($a)
-
{
-
$out = '';
-
foreach($a as $t){
-
$out.= ($t['name']=='class') ? '(='.$t['att'].')' : '';
-
$out.= ($t['name']=='id') ? '[='.$t['att'].']' : '';
-
$out.= ($t['name']=='style') ? '{='.$t['att'].'}' : '';
-
$out.= ($t['name']=='cite') ? ':'.$t['att'] : '';
-
}
-
return $out;
-
}
Here is where much of the converting takes place. We have a map of conversion strings from html tags to textile entities and convert them here. We preserve any classes, ids and attributes using the splat method and the sci method and return the text. The splat method is quite sophisticated and long to explain, but it should become clear when you look at it below.
Now on to the glyphs and the list methods:
-
function detextile_process_glyphs($text) {
-
'’'=>'\'', # single closing
-
'‘'=>'\'', # single opening
-
'”'=>'"', # double closing
-
'“'=>'"', # double opening
-
'—'=>'--', # em dash
-
'–'=>' - ', # en dash
-
'×' =>'x', # dimension sign
-
'™'=>'(TM)', # trademark
-
'®' =>'(R)', # registered
-
'©' =>'(C)', # copyright
-
'…'=>'...' # ellipsis
-
);
-
-
foreach($glyphs as $f=>$r){
-
}
-
return $text;
-
}
Easy. It simply converts some html entities for glyphs into their textile equivalents.
The list method:
-
function detextile_process_lists($text) {
-
$list = false;
-
-
foreach($text as $line){
-
-
$line = "";
-
$list = "o";
-
$line = "";
-
$list = false;
-
$line = "";
-
$list = "u";
-
$line = "";
-
$list = false;
-
} else if ($list == 'o'){
-
} else if ($list == 'u'){
-
}
-
$glyph_out[] = $line;
-
}
-
-
}
This method is a bit more tricky. It wipes out any list starting tags (ul, ol) and converts all li-tags into their textile equivalents - either "# " (for ordered lists) or " *" (for unordered lists).
How Do You Use The Code?
Using the code is darn easy. You just invoke the detextile method upon your html code:
An Example
Here is some example html code we want to convert:
Detextile output:
-
*This is some bold text*p. _This is italic text_p. p. <u>Underline text man</u>p. * ul list item1 * ul list item2* ul list item3# ol list item1# ol list item2# ol list
-
item3
Cool! And it was easy as well!
Get The Code
Here are all methods for your cakephp textile helper. You can plug them in into any Textile Helper for other frameworks of course:
-
// -------------------------------------------------------------
-
// The following functions are used to detextile html, a process
-
// still in development.
-
// By Tim Koschützki
-
-
// Based on code from http://www.aquarionics.com
-
-
// -------------------------------------------------------------
-
function detextile($text) {
-
-
-
'table','tr','td','u','del','sup','sub','blockquote');
-
-
foreach($oktags as $tag){
-
}
-
-
$text = $this->detextile_process_glyphs($text);
-
$text = $this->detextile_process_lists($text);
-
-
-
return $this->decode_high($text);
-
}
-
-
function detextile_process_glyphs($text) {
-
'’'=>'\'', # single closing
-
'‘'=>'\'', # single opening
-
'”'=>'"', # double closing
-
'“'=>'"', # double opening
-
'—'=>'--', # em dash
-
'–'=>' - ', # en dash
-
'×' =>'x', # dimension sign
-
'™'=>'(TM)', # trademark
-
'®' =>'(R)', # registered
-
'©' =>'(C)', # copyright
-
'…'=>'...' # ellipsis
-
);
-
-
foreach($glyphs as $f=>$r){
-
}
-
return $text;
-
}
-
-
function detextile_process_lists($text) {
-
$list = false;
-
-
foreach($text as $line){
-
-
$line = "";
-
$list = "o";
-
$line = "";
-
$list = false;
-
$line = "";
-
$list = "u";
-
$line = "";
-
$list = false;
-
} else if ($list == 'o'){
-
} else if ($list == 'u'){
-
}
-
$glyph_out[] = $line;
-
}
-
-
}
-
-
function processTag($matches) {
-
$a = $this->splat($atts);
-
-
'em'=>'_',
-
'i'=>'__',
-
'b'=>'**',
-
'strong'=>'*',
-
'cite'=>'??',
-
'del'=>'-',
-
'ins'=>'+',
-
'sup'=>'^',
-
'sub'=>'~',
-
'span'=>'%');
-
-
-
return $phr[$tag].$this->sci($a).$content.$phr[$tag];
-
} elseif($tag=='blockquote') {
-
return 'bq.'.$this->sci($a).' '.$content;
-
return $tag.$this->sci($a).'. '.$content;
-
} elseif ($tag=='a') {
-
$out = '"'.$content;
-
$out.= '":'.$t['href'];
-
return $out;
-
} else {
-
return $all;
-
}
-
}
-
-
// -------------------------------------------------------------
-
function filterAtts($atts,$ok)
-
{
-
foreach($atts as $a) {
-
if($a['att']!='') {
-
$out[$a['name']] = $a['att'];
-
}
-
}
-
}
-
# dump($out);
-
return $out;
-
}
-
-
// -------------------------------------------------------------
-
function sci($a)
-
{
-
$out = '';
-
foreach($a as $t){
-
$out.= ($t['name']=='class') ? '(='.$t['att'].')' : '';
-
$out.= ($t['name']=='id') ? '[='.$t['att'].']' : '';
-
$out.= ($t['name']=='style') ? '{='.$t['att'].'}' : '';
-
$out.= ($t['name']=='cite') ? ':'.$t['att'] : '';
-
}
-
return $out;
-
}
-
-
// -------------------------------------------------------------
-
function splat($attr) // returns attributes as an array
-
{
-
$atnm = '';
-
$mode = 0;
-
-
$ok = 0;
-
switch ($mode) {
-
case 0: // name
-
$atnm = $match[1]; $ok = $mode = 1;
-
}
-
break;
-
-
case 1: // =
-
$ok = 1; $mode = 2;
-
break;
-
}
-
$ok = 1; $mode = 0;
-
}
-
break;
-
-
case 2: // value
-
'att'=>str_replace('"','',$match[1]));
-
$ok = 1; $mode = 0;
-
break;
-
}
-
'att'=>str_replace("'",'',$match[1]));
-
$ok = 1; $mode = 0;
-
break;
-
}
-
$arr[]=
-
'att'=>$match[1]);
-
$ok = 1; $mode = 0;
-
}
-
break;
-
}
-
if ($ok == 0){
-
$mode = 0;
-
}
-
}
-
if ($mode == 1) $arr[] =
-
-
return $arr;
-
}
The code is based on an unfinished start from http://www.aquarionics.com. Thanks to the guys over there!
Have fun!
12 Comments
this works great. thank you thank you thank you for sharing!!! i was dreading the regular expressions.
Np, Paul. :]
[...] You can read more here [...]
This is a great help, Tim. Can you explain how you'd use this to convert an entire site?
Would you run it page by page, copying and pasting each page's HTML into a new PHP page based on this snippet:
detextile($htmlText);
?>
Or have you found a way to automate the process?
So you have an entire arsenal of html pages that you want to textilize? Oh well that should be very easy to automate.
Depending on how your pages are represented (in the file system or in the dbd or wherever) you might use file_get_contents() or some database fetching. Invoke detextile() on the result and off you go. : )
You can use the glob() function or, what I prefer, the Folder class of CakePHP, to read all file names from a given folder (also recursively) and then use file_get_contents().
Or did you mean something else?
Wow, that was a quick response! Thanks.
Yes, that does help.
(I'll try to post my code with broken tags and a bbcode-like container.)
To test, I'm trying to set up an HTML form into which I can past HTML to be converted. In the body I have this:
[code]
< form action="" method="post" id="detextile">
Original HTML
Conversion to Textile
[/code]
Then above the head I have this:
[code]
if (isset($_POST['htmlText']))
{
echo $textile->detextile($_POST['htmlText']);
}
[/code]
The script gets as far as the if then stops. What do I have to do differently to invoke your textile function? Man, I must be missing something really basic!
Maybe I over did it. I'll submit the code again.
[code]<form action="" method="post" id="detextile">
Original HTML
Conversion to Textile
[/code]
The HTML form
[code]
if (isset($_POST['htmlText']))
{
echo $textile->detextile($_POST['htmlText']);
}
[/code]
Darn. The HTML is still not showing. Feel free to delete that and I'll try again. I looks like youre running WP, so maybe will work.
<form action="" method="post" id="detextile">
the HTML
Original HTML
Conversion to Textile
Invoking your Textile function
if (isset($_POST['htmlText']))
{
echo $textile->detextile($_POST['htmlText']);
}
Sorry about that. What's the trick for showing HTML?
Yeah, html sowing is a little broken. Gotta fix that up. : )
So hrm, o you have a textarea that has the htmltext id or name? I don't see that in your form.. so the if cannot succeed.
listing inside listing is not working..
# test
## item1
## item2
## item3
# test2
## new item
## new item1


[...] Koschuetzki has a new tutorial posted today for CakePHP users out there - it’s a method for transforming HTML content into [...]