Parsing HTML with C++ (With extra UTF-8 woes)

So I was tasked with making a C++ project that scraped data from HTML pages.

For those who aren’t in the “know”, there’s no neato HTML DOM library like JSoup in C++. There’s nothing with jQuery-like selectors either. There’s some niche projects here and there (and there is a google project I was turned onto which I decided probably wasn’t for the best, but worth mentioning anyway) but nothing really substantial.

In PHP, you could do a couple things. Either use “Simple HTML DOM Parser“, or use simplexml with XPath. Both kind of suck for their own reasons (Simple HTML DOM Parser is written in PHP and causes tons of resources to be used for simple operations, and simplexml with XPath is… not fun, turns out C++ isn’t much better, but we’ll get to it later).

In JavaScript, you have jQuery. In Java, JSoup. Both are amazing. If I’m scraping something, I’d prefer to use these… but it’s just too bad I don’t use Java for much else besides android, and I don’t know how well nodejs actually works with jQuery, or if it does, or if it has something similar… I’ve never used node. Maybe worth looking into some day.

C++, though, I had to hack. It’s weird, usually you’ll find C++ libraries for anything but in this case the internet didn’t help me out too much, besides this blog post, so I went with it.

The tools I ended up using for my project:

  • CURL
  • MySQL++
  • libxml++
  • libtidy

The goal was to write a simple program which would rip data from pages and store them in a database to replace a legacy PHP command line application which supported UTF-8 in it’s entirety from start to finish (This part is important because there’s many potential pitfalls when working with all these libraries and UTF-8). This was developed on Ubuntu and ported to Debian (and is compatible with MySQL and MariaDB), not sure why I’m mentioning that but hey, you never know what kind of information might be important.

First of all, HTML to XML

libxml++ doesn’t parse HTML, as the name would imply it parses… you guessed it! XML. The first step of this process is to grab the HTML from the page

 C++ |  copy code |? 
01
02
static size_t http_write_callback(void *contents, size_t size, size_t nmemb, void *userp) {
03
	if(userp) {
04
		((std::stringstream*)userp)->write((char*)contents, size * nmemb);
05
	}
06
 
07
	return size * nmemb;
08
}
09
 
10
bool get_url(std::string url, std::string* content) {
11
	CURL* ch = curl_easy_init();
12
 
13
	if(ch) {
14
		std::stringstream ss;
15
 
16
		struct curl_slist *headers = NULL;
17
 
18
		headers = curl_slist_append(headers, "charsets: utf-8");
19
 
20
		curl_easy_setopt(ch, CURLOPT_HTTPHEADER, headers);
21
		curl_easy_setopt(ch, CURLOPT_URL, url.c_str());
22
		curl_easy_setopt(ch, CURLOPT_WRITEFUNCTION, http_write_callback);
23
		curl_easy_setopt(ch, CURLOPT_WRITEDATA, &ss);
24
 
25
		CURLcode res = curl_easy_perform(ch);
26
 
27
		if(res == CURLE_OK) {
28
			(*content) = ss.str();
29
			return true;
30
		}
31
 
32
		curl_easy_cleanup(ch);
33
	}
34
 
35
	return false;
36
}
37

With UTF-8 options, obviously.

Next is using tidy to convert the HTML into valid XML. For my example, the page I’m targeting started with a DOCTYPE DTD declaration.

This is important because the xml1-transitional.dtd it was using (and probably a lot of other HTML DTDs) will cause issues with tidy. Especially with validation turned on, which I was forced to disable for many reasons, but the DTD stuff certainly wasn’t helping.

 C++ |  copy code |? 
01
02
std::string html_xml(std::string html) {
03
	TidyDoc tidyDoc = tidyCreate();
04
	TidyBuffer tidyOutputBuffer = {0};
05
 
06
	bool configSuccess = tidyOptSetBool(tidyDoc, TidyXmlOut, yes) 
07
		&& tidyOptSetBool(tidyDoc, TidyQuiet, yes) 
08
		&& tidyOptSetBool(tidyDoc, TidyQuoteNbsp, no)
09
		&& tidyOptSetBool(tidyDoc, TidyXmlDecl, yes) //XML declaration on top of the content
10
		&& tidyOptSetBool(tidyDoc, TidyForceOutput, yes)
11
		&& tidyOptSetValue(tidyDoc, TidyInCharEncoding, "utf8") // Output from here should be UTF-8
12
		&& tidyOptSetValue(tidyDoc, TidyOutCharEncoding, "utf8") // Output from CURL is UTF-8
13
		&& tidyOptSetBool(tidyDoc, TidyNumEntities, yes) 
14
		&& tidyOptSetBool(tidyDoc, TidyShowWarnings, no) 
15
		&& tidyOptSetInt(tidyDoc, TidyDoctypeMode, TidyDoctypeOmit); //Exclude DOCTYPE
16
 
17
	int tidyResponseCode = -1;
18
 
19
	if (configSuccess) {
20
		std::vector<unsigned char> bytes(html.begin(), html.end());
21
 
22
		TidyBuffer buf;
23
		tidyBufInit(&buf);
24
 
25
		for(size_t i = 0; i < bytes.size(); i++) {
26
			tidyBufAppend(&buf, &bytes[i], 1);
27
		}
28
 
29
		tidyResponseCode = tidyParseBuffer(tidyDoc, &buf);
30
	}
31
 
32
	if (tidyResponseCode >= 0)
33
		tidyResponseCode = tidyCleanAndRepair(tidyDoc);
34
 
35
	if (tidyResponseCode >= 0)
36
		tidyResponseCode = tidySaveBuffer(tidyDoc, &tidyOutputBuffer);
37
 
38
	if (tidyResponseCode < 0) {
39
		throw ("Tidy encountered an error while parsing an HTML response. Tidy response code: " + tidyResponseCode);
40
	}
41
 
42
	std::string tidyResult = (char*) tidyOutputBuffer.bp;
43
 
44
	tidyBufFree(&tidyOutputBuffer);
45
	tidyRelease(tidyDoc);
46
 
47
	return tidyResult;
48
}
49

This is the modified version of MostThingsWeb’s function, which seems to handle UTF-8 properly. I’m not sure if it’s all necessary (the buffer crap, particularly) but I’m not about to fiddle. It compiles, it works, good times.

 C++ |  copy code |? 
01
	std::string tidy = html_xml(spl);
02
 
03
	xmlpp::DomParser parser;
04
	parser.set_substitute_entities();
05
	parser.parse_memory(tidy);
06
 
07
	if(!parser) {
08
		this->last_error = "Parser problem.";
09
		return false;
10
	}
11
 
12
	xmlpp::Document* document = parser.get_document();
13
 
14
	if(!document) {
15
		this->last_error = "Invalid Document";
16
		return false;
17
	}
18
 
19
	xmlpp::Node* root = document->get_root_node();
20
 
21
	if(!root) {
22
		this->last_error = "Invalid Document Root";
23
		return false;
24
	}

XPath is nasty, yo

As luck would have it, the legacy PHP program my program was replacing used SimpleXML and XPath to scrape data from the pages, which made using XPath in this instance pretty breezy. XPath is still way, way more confusing than selectors but since that’s not really an option, it was nice that this work was already done for me. The main issue was porting the logic over, which wasn’t a big deal.

Here’s some examples for you

 C++ |  copy code |? 
01
02
	xmlpp::NodeSet syn = root->find("//td[@valign=\"top\"]");
03
 
04
	if(syn.size() >= 2) {
05
		std::string tempSynopsis = get_inner_text(syn[2]);
06
 
07
		if(tempSynopsis.find("No synopsis has been added for this series yet") == std::string::npos) {
08
			result->synopsis = tempSynopsis.substr(8);
09
		}
10
	}
11
 
12
	xmlpp::NodeSet imgAttr = root->find("//img[contains(@src, \"http://cdn.myanimelist.net/images/anime\")]");
13
 
14
	if(imgAttr.size() > 0) {
15
		xmlpp::Element* imageElement = dynamic_cast<xmlpp::Element*>(imgAttr[0]);
16
 
17
		if(imageElement) {
18
			result->title = imageElement->get_attribute("alt")->get_value();
19
			result->image_url = imageElement->get_attribute("src")->get_value();
20
		}
21
	}
22
 
23
	xmlpp::NodeSet sidebarQuery = root->find("//td[@class=\"borderClass\"]");
24
 
25
	if(sidebarQuery.size() > 0) {
26
		xmlpp::Element* sidebarElement = dynamic_cast<xmlpp::Element*>(sidebarQuery[0]);
27
 
28
		xmlpp::Node::NodeList sideChildren = sidebarElement->get_children();
29
 
30
		for(xmlpp::Node::NodeList::iterator child = sideChildren.begin(); child != sideChildren.end(); ++child) {
31
			if((*child)->get_name().compare("div") == 0) {
32
				std::string nodeValue = get_inner_text((*child));
33
 
34
				size_t ppos = nodeValue.find(":");
35
 
36
				if(ppos != std::string::npos) {
37
					std::string preSemi = nodeValue.substr(0, ppos);
38
					std::string postSemi = nodeValue.substr(ppos + 1, nodeValue.size());
39
 
40
					if(postSemi.substr(0, 1).compare("\x20") == 0) {
41
						postSemi = postSemi.substr(1);
42
					}
43
 
44
					if(preSemi.compare("Japanese") == 0) {
45
						result->other_japanese = postSemi;
46
					} else if(preSemi.compare("Type") == 0) {
47
						result->type = postSemi;
48
					} else if(preSemi.compare("Episodes") == 0) {
49
						result->episodes = postSemi;
50
					} else if(preSemi.compare("Status") == 0) {
51
						result->status = postSemi;
52
					} else if(preSemi.compare("Aired") == 0) {
53
						result->aired = postSemi;
54
					} else if(preSemi.compare("English") == 0) {
55
						result->other_english = postSemi;
56
					} else if(preSemi.compare("Synonyms") == 0) {
57
						if(result->other_english.empty()) {
58
							result->other_english = postSemi;
59
						} else {
60
							result->other_english += (", " + postSemi);
61
						}
62
					} else if(preSemi.compare("Duration") == 0) {
63
						result->duration = postSemi;
64
					} else if(preSemi.compare("Genres") == 0) {
65
						result->genres = postSemi;
66
					} else if(preSemi.compare("Producers") == 0) {
67
						result->producers = postSemi;
68
					}
69
				}
70
			}
71
		}
72
	}
73

The page in question my parser is targeting (for practicing with this stuff on your own) is here: http://myanimelist.net/anime.php?id=65

There’s other pages too, but you get the idea.

get_inner_text?

Well, I wrote a function to get the text (sort of like $(‘.stuff’).text() in jQuery or something) of a node from libxml++. It took a while to get right for my purposes, but it’s otherwise fairly straightforward.

 C++ |  copy code |? 
01
02
bool is_dead_character(int c) {
03
	return (c == '\n' || c == '\r' || c == '\t' || c == 0x20);
04
}
05
 
06
bool is_dead_string(std::string is) {
07
	bool r = true;
08
 
09
	std::string::iterator it(is.begin());
10
	std::string::iterator end(is.end());
11
 
12
	for ( ; it != end; ++it) {
13
		if(!is_dead_character(*it)) {
14
			r = false;
15
			break;
16
		}
17
	}
18
 
19
	return r;
20
}
21
 
22
std::string get_inner_text(xmlpp::Node* node) {
23
	std::string ret;
24
 
25
	xmlpp::Node::NodeList list = node->get_children();
26
 
27
	for(xmlpp::Node::NodeList::iterator iter = list.begin(); iter != list.end(); ++iter) {
28
		// I needed line breaks, if you don't, change this
29
		if((*iter)->get_name().compare("br") == 0) {
30
			ret += "<br>";
31
			continue;
32
		}
33
 
34
		// I would remove this line if you really want to capture everything
35
		if((*iter)->get_name().compare("comment") == 0 || (*iter)->get_name().compare("small") == 0) {
36
			continue;
37
		}
38
 
39
		// Recursive
40
		if((*iter)->get_name().compare("text") != 0) {
41
			ret += get_inner_text((*iter));
42
			continue;
43
		}
44
 
45
		const xmlpp::TextNode* text = dynamic_cast<const xmlpp::TextNode*>(*iter);
46
 
47
		if(!text) continue;
48
 
49
		std::string go = text->get_content();
50
 
51
		// This function just skips completely blank entries (with only spaces, or only \n or whatever
52
		// I'd remove this if you were to use it
53
		if(is_dead_string(go)) {
54
			continue;
55
		}
56
 
57
		std::string::iterator finalc(go.end());
58
 
59
		finalc--;
60
 
61
		// Erases \n on the end of the string if it exists
62
		if((*finalc) == '\n') {
63
			go.erase(finalc, go.end());
64
		}
65
 
66
		std::string::size_type pos = 0;
67
 
68
		// Replaces newlines with spaces
69
		while((pos = go.find("\n", pos)) != std::string::npos) {
70
			go.replace(pos, 1, "\x20");
71
			pos++;
72
		}
73
 
74
		ret += go;
75
	}
76
 
77
	return ret;
78
}
79

Maybe somebody will find it useful, maybe not.

MySQL++ and UTF-8
So, yeah. MySQL++ (and mysql in PHP) doesn’t support UTF-8 crap implicitly unless your server is configured that way. You’ll end up with a bunch of jibberish, even if your row is utf8_unicode_ci like mine was.

 C++ |  copy code |? 
1
bool sql::connect(std::string host, std::string username, std::string password, std::string db) {
2
	conn = new mysqlpp::Connection(false);
3
 
4
	conn->set_option(new mysqlpp::FoundRowsOption(true));
5
	conn->set_option(new mysqlpp::SetCharsetNameOption("utf8")); // In PHP it's something like mysql_query("SET NAMES utf8"); or mysql_set_charset("utf8");
6
 
7
	return conn->connect(db.c_str(), host.c_str(), username.c_str(), password.c_str());
8
}

After that’s settled, we’re ready to finally slap it all together hopefully.

 C++ |  copy code |? 
01
bool update_mal_entry(sql* s, std::string mal_id) {
02
	if(mal_id.empty()) return true;
03
 
04
	printf("MyAnimeList ID: %s\n", mal_id.c_str());
05
 
06
	mal* pmal = new mal(MAL_PREFIX + mal_id);
07
 
08
	mal_result r;
09
	if(pmal->exec(&r)) {
10
		int num_episodes = 0;
11
 
12
		try {
13
			num_episodes = std::stoi(r.episodes);
14
		} catch(...) { /* On Failure, I actually want it to be zero. I don't care. */ }
15
 
16
		try {
17
			mysqlpp::Query query = 
18
				s->connection()->query(
19
					"UPDATE animedata SET "
20
					"`status`=%0, `synopsis`=%1q, `genres`=%2q, `altnames`=%3q, "
21
					"`duration`=%4q, `aired`=%5q, `numep`=%6, `malimg`=%7q, `japanese`=%8q WHERE `MAL`=%9");
22
 
23
			query.parse();
24
 
25
			mysqlpp::SQLQueryParms parms;
26
 
27
			parms << r.statusNum();
28
			parms << r.synopsis;
29
			parms << r.genres;
30
			parms << r.other_english;
31
			parms << r.duration;
32
			parms << r.aired;
33
			parms << num_episodes;
34
			parms << r.image_url;
35
			parms << r.other_japanese;
36
			parms << std::stoi(mal_id);
37
 
38
			// Want to check if your query is broken as fuck? uncomment this.
39
			//std::cout << "Query [" << query.str(parms) << "]" << std::endl;
40
 
41
			mysqlpp::SimpleResult res = query.execute(parms);
42
 
43
			return (res.rows() > 0);
44
		} catch (const mysqlpp::BadQuery& er) {
45
			printf("Query error: %s\n", er.what());
46
		} catch (const mysqlpp::BadConversion& er) {
47
			printf("Conversion error: %s\n", er.what());
48
		} catch (const mysqlpp::Exception& er) {
49
			printf("Exception: %s\n", er.what());
50
		}
51
	} else {
52
		printf("mal::exec failed!\n");
53
	}
54
 
55
	return false;
56
}

That concludes my lazy as hell article. Hopefully it’s useful to you.

Oh! One more thing!

Let’s say you have a database with entries already improperly UTF-8 encoded (as in, writing UTF-8 crap to the DB without the UTF-8 charset, bunch of jibberish crap).

You can convert the rows to proper UTF-8 using this:
UPDATE hashtags SET tag=CONVERT(BINARY CONVERT(tag USING latin1) USING utf8)

‘tag’ row inside of the ‘hashtags’ table being the target of conversion here.

It’ll turn stuff like the picture below

into something more like

AND NOW I’M FINALLY FINISHED CLEANING UP THE MESS.

2 comments

  1. Tim says:

    Thanks for the article but I really hate the code style. If you present code to the world please try to use idiomatic C++[11] code.
    - please check the pointer before touching it or better use references when you don’t expect a nullptr
    - please apply RAII for resource ownership, isn’t pmal a memory leak
    - is_dead_string can be replaced by std::any_of( , , is_dead_char)
    - always check the result of a dynamic_cast
    - use return values instead of mutable arguments
    - constant correctness
    - …

    • s0beit says:

      Yeah, there’s some problems. I’m doing things half-correctly as you say, but I’ll try to respond anyway
      - Indeed, I messed that up a couple times.
      - pmal is a memory leak, but it is also a resource only used once per execution. I can delete it after use, and I will add this, but it caused no major issues due to it’s nature. It really doesn’t need to be allocated at all I suppose, I left it that way from test phase.
      - Noted
      - Indeed
      - Mutable arguments were preferable in this example to me, frankly I don’t see the issue. It is the easiest method to use to both check the status and return the data I need without wrapping it in some other class or struct with result embedded into it.
      - I’m not sure I understand
      - “…”