Read How to parse HTML...∞
Here is a short list of HTML parsers for popular languages. XML parsers seem to be easy enough to find.
For UNIX command line tools (awk, sed, etc), consider converting HTML to XHTML using Tidy∞
and then to PYX using XMLStarlet∞
## extract all hyperlinks
<bookmark.htm tidy -asxhtml 2>/dev/null | xmlstarlet pyx | sed '/^(a/,/^)a/!d;/^Ahref /!d;s///'
Ideally, you'll want to use the features of your language or application software to do this. Here are some examples:
If you cannot use such a technique because your application (e.g. a text editor) does not allow that level of programmability, you may be able to get by with an expression such as:
Note however that this may be much slower than the equivalent negated expression.
This is another of those situations where regular expressions alone are not enough. The best way is to match the line against multiple patterns:
/foo/ && $str
Again, if you cannot use such a technique, try