Simple Machines Bug and Feature Tracker
|Anonymous | Login||12-07-2013 10:52 AM|
|My View | View Issues|
|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0004981||SMF 2.1||Posts||public||2012-04-25 16:45||2013-10-08 23:52|
|Priority||normal||Severity||minor||Reproducibility||have not tried|
|Summary||0004981: handling MS Smart Quotes|
|Description||As reported by MrPhil: http://www.simplemachines.org/community/index.php?topic=475099 [^]|
The support boards for all versions of SMF are clogged with reports of certain characters cutting off the rest of a post, or otherwise apparently causing mischief. The root cause of these problems is that people cut and paste text from Microsoft products (especially Word) that contain MS's "Smart Quotes", which are found only in CP-1252 encoding. My proposal is that all incoming text (from TEXT, TEXTAREA, and possibly other input fields) be scanned for Smart Quotes characters (binary), and any found should be replaced by HTML entities. str_replace() might do the job.
Read the topic. It has very nicely formatted description and a lot more text.
MantisBT 1.2.8 (Modified)[^] Copyright © 2000 - 2010 Mantis Group
Spuds (SMF Friend)
edited on: 2012-05-13 15:02
Should we look for all of the special MS codes or just the most popular ... Also seems like we may need to search both utf-8 and chr() representations, but I'm not sure.
Anyway here is something to start with ...
* Microsoft use their own character set Code Page 1252 (CP1252), which is a
* superset of ISO 8859-1, defining several characters between 128 and 159
* that are not normally displayable. This converts the more popular ones
* that appear from a cut and paste from windows.
* @param string $string
* @return string $string
// UTF-8 occurences of MS special characters
$findchars_utf8 = array(
"\xe2\80\x9a", // single low-9 quotation mark
"\xe2\80\x9e", // double low-9 quotation mark
"\xe2\80\xa6", // horizontal ellipsis
"\xe2\x80\x98", // left single curly quote
"\xe2\x80\x99", // right single curly quote
"\xe2\x80\x9c", // left double curly quote
"\xe2\x80\x9d", // right double curly quote
"\xe2\x80\x93", // en dash
"\xe2\x80\x94", // em dash
// windows 1252 / iso equivalents
$findchars_iso = array(
// safe replacements
$replacechars = array(
',', // ‚
',,', // „
'...', // …
"'", // ‘
"'", // ’
'"', // “
'"', // ”
'-', // –
'--', // —
$string = str_replace($findchars_utf8, $replacechars, $string);
$string = str_replace($findchars_iso, $replacechars, $string);
AngelinaBelle (SMF Friend)
Is function ConvertUtf8 already attempting conversion of the windows-1252 code page as of SMF 2.0?
Has it got the correct mapping?
Spuds (SMF Friend)
To my knowledge no, Most of the DB to UTF8 conversion depends on MySql doing the conversion work when you change character sets.
SMF has a couple of translation tables built in to support characters sets that MySQL does not natively support (windows 1253 and 1255) which we use in our language files.
Seems like an easy test to see if what would happen and if anything is converted. If its not converted and just translated / dropped we could, in theory, add a function like above to the conversion e.g. UPDATE `table` SET `col` = REPLACE(`col`, CHAR(133), '...'); or maybe UPDATE `table` SET `col` = REPLACE(`col`, CHAR(133), 0xE280A6); etc IF moving to UTF8 from a code page that you know should not have CHR(133) in it of course!
From the original report (and from a couple of quick tests I did) it seems to me that these chars cannot enter into the database.
In 2.1 Spuds put the code to convert such chars into something else (their acceptable equivalent).
So, summing up the two things any of these chars should be able to enter into an SMF table, if so there is no reason to take them into account during the conversion.
Marking this as solved; future issues should go on Github anyway.
parse_bbc calls the sanitizer function so all the major places are already dealt with, we've never actually seen it reported as being anywhere else (like other input fields)
2.1 also defaults to UTF-8 for fresh installs so that's less of an issue anyway.