Next: Introduction, Previous: (dir), Up: (dir) [Contents][Index]
This is the chtml-matcher Reference Manual, version 1.0, generated automatically by Declt version 3.0 "Montgomery Scott" on Tue Dec 22 12:00:21 2020 GMT+0.
• Introduction | What chtml-matcher is all about | |
• Systems | The systems documentation | |
• Files | The files documentation | |
• Packages | The packages documentation | |
• Definitions | The symbols documentation | |
• Indexes | Concepts, functions, variables and data types |
chtml-matcher performs pattern-based unification over HTML via a set of compiled nested closures. It uses the closure-html library to parse HTML to lhtml, a lisp form of HTML. A template list is passed to (match-template template lhtml) and returns a bindings object containing an alist of all the extracted information.
The semantics are reasonably intuitive, but might require a little playing around to get a good feel for how to solve most common problems. The API is small and the package.lisp provides pointers to where to look. The whole library is less than 1k lines of code so easy enough to read through.
Clone it from github. The old repository on common-lisp.net is deprecated.
chtml-matcher depends on my home-brew cl-stdutils, closure-html, cl-ppcre, and f-underscore, although all but closure-html could be removed if necessary.
The DSL provides a light-weight way to extract fields from nested HTML/XML structure represented in LHTML (as produced by closure-html). A template is a declarative representation of substructure with embedded variables that are bound when the substructure matches.
Substructure is loosely matched, such that if any given body element doesn't match, the next child is considered until all the template body elements have matched a lhtml element or the end of the elment has been reached without a match.
Prepending < to a tag enables a depth-first search for that tag so you can avoid specifying the parent path (similar to // in xpath)
Any matching template that consists of a variable reference results in a binding set being created and returned if all elements of the template node successfully match.
Additional reserved operators allow you more flexibility on managing what is matched and how bindings from subtrees are combined
all: match same template multiple times over the children of a given node and store them as a list attached to a fresh bindings list
merge: create a single binding out of each of the sub-bindings. A node body has an implicit merge
nth: find the nth instance that matches the full body of this operator
regex: matches if regex returns register values for a string (as a list)
fn: Run the referenced function symbol on the current parse state and return bindings, t or nil as appropriate.
I've recently been mining some posts from vBulletin sites. I go to the last day's posts, get a list of all the new posts, then go to the thread and grab the post body. The following two templates do 90% of the work. Of course, I have to write code to convert the data I extract to web page fetches, etc.
(defparameter *vbulletin-search-template*
'(<tbody nil
(all ?records
(tr nil
(td nil)
(td ((class "alt1"))
(div nil
(a ((href ?thread-uri))
?thread-name)))
(td ((class "alt2") (title ?activity))
(div nil ?post-date
(span nil ?post-time)
(a ((href ?user-uri))
?username)
(a ((href ?last-post-uri)))))))))
=>
'(:records ((:thread-name . "Thread name")
(:thread-uri . "Thread URI")
(:post-date . "Date String")
(:post-time . "Time String")
(:username . "Username")
(:user-uri . "URI String")
(:last-post-uri . "URI String"))
...)
This looks for a table body in the search results page, then gets bindings for all matching elements and puts them within another bindings object bound to :records as specified by 'all'. The pattern pulls out all the user, thread, post and date information for all results. You can match elements on strings, regular expressions and arbitrary function calls as well.
I use subst to customize the following pattern to find a particular post in a page. It replaces 'post_message_?' with a unique id for a post then returns its thread number and the entire post body.
(defparameter *vbulletin-post-template*
`(<tbody nil
(tr nil (<a ((name ?post-num))))
(tr nil)
(tr nil (?post-body <div ((id "post_message_?"))))))
I use Firefox FireBug to inspect the HTML tree, identify the best unique enclosing context I can specify and then provide enough structure to uniquely capture the data I want. This approach is highly robust to many small HTML changes and should be reasonably fast.
Next: Files, Previous: Introduction, Up: Top [Contents][Index]
The main system appears first, followed by any subsystem dependency.
• The chtml-matcher system |
Ian Eslick
Ian Eslick
MIT style license
A unifying template matcher based on closure-html for web scraping and extraction
1.0
chtml-matcher.asd (file)
Files are sorted by type and then listed depth-first from the systems components trees.
• Lisp files |
• The chtml-matcher.asd file | ||
• The chtml-matcher/package.lisp file | ||
• The chtml-matcher/bindings.lisp file | ||
• The chtml-matcher/matcher.lisp file |
Next: The chtml-matcher/package․lisp file, Previous: Lisp files, Up: Lisp files [Contents][Index]
chtml-matcher.asd
chtml-matcher (system)
Next: The chtml-matcher/bindings․lisp file, Previous: The chtml-matcher․asd file, Up: Lisp files [Contents][Index]
chtml-matcher (system)
package.lisp
Next: The chtml-matcher/matcher․lisp file, Previous: The chtml-matcher/package․lisp file, Up: Lisp files [Contents][Index]
chtml-matcher (system)
bindings.lisp
Previous: The chtml-matcher/bindings․lisp file, Up: Lisp files [Contents][Index]
chtml-matcher (system)
matcher.lisp
Next: Definitions, Previous: Files, Up: Top [Contents][Index]
Packages are listed by definition order.
• The chtml-matcher-system package | ||
• The chtml-matcher package |
Next: The chtml-matcher package, Previous: Packages, Up: Packages [Contents][Index]
chtml-matcher.asd
Previous: The chtml-matcher-system package, Up: Packages [Contents][Index]
package.lisp (file)
Definitions are sorted by export status, category, package, and then by lexicographic order.
• Exported definitions | ||
• Internal definitions |
Next: Internal definitions, Previous: Definitions, Up: Definitions [Contents][Index]
• Exported macros | ||
• Exported functions | ||
• Exported generic functions | ||
• Exported classes |
Next: Exported functions, Previous: Exported definitions, Up: Exported definitions [Contents][Index]
bindings.lisp (file)
Next: Exported generic functions, Previous: Exported macros, Up: Exported definitions [Contents][Index]
Clear all bindings
bindings.lisp (file)
Convenience function for generating state from an lhtml tree
matcher.lisp (file)
bindings.lisp (file)
Return an alist of bindings
bindings.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
Make bindings, optionally with a seed variable and value
bindings.lisp (file)
Top level matcher
matcher.lisp (file)
bindings.lisp (file)
Next: Exported classes, Previous: Exported functions, Up: Exported definitions [Contents][Index]
bindings.lisp (file)
Set all bindings in bindings list and return the dict. First argument dominates.
bindings.lisp (file)
bindings.lisp (file)
bindings.lisp (file)
Previous: Exported generic functions, Up: Exported definitions [Contents][Index]
bindings.lisp (file)
standard-object (class)
:bindings
bindings (generic function)
(setf bindings) (generic function)
Previous: Exported definitions, Up: Definitions [Contents][Index]
• Internal special variables | ||
• Internal macros | ||
• Internal functions | ||
• Internal generic functions | ||
• Internal structures |
Next: Internal macros, Previous: Internal definitions, Up: Internal definitions [Contents][Index]
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
Next: Internal functions, Previous: Internal special variables, Up: Internal definitions [Contents][Index]
Test the invariant properties of the state
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
Make it easy to perform a non-distructive parse operation over a subtree based on the current parse state
matcher.lisp (file)
matcher.lisp (file)
Generate local vars for various for template components
matcher.lisp (file)
Next: Internal generic functions, Previous: Internal macros, Up: Internal definitions [Contents][Index]
Convert a string or symbol to a keyword symbol
matcher.lisp (file)
Verify that attrs1 is a proper subset of attrs2
under equalp of string form of names. Ignore variable
attribute values
matcher.lisp (file)
Given an attribute template and the current node, when bindings exist and a variable occurs in the template attribute value position, add it to the bindings
matcher.lisp (file)
Bind the node to the variable including attributes if wanted
matcher.lisp (file)
Bind the body list to the variable in bindings
matcher.lisp (file)
If body is empty or has one element, return t
matcher.lisp (file)
bindings.lisp (file)
matcher.lisp (file)
Make a duplicate of the current state
matcher.lisp (file)
Current node is always first element of the body list (Invariant)
matcher.lisp (file)
List of tags from root to current
matcher.lisp (file)
Find all matching instances of a node in the tree
matcher.lisp (file)
Find a node and bind it and it’s attributes if provided
matcher.lisp (file)
Walk the child list of the parser until a match is found. If no more children, returns nil.
matcher.lisp (file)
Find the nth occurance of tag and attributes from current state via next-node
matcher.lisp (file)
When we’re done with the body, return to prior path, popping as necessary
matcher.lisp (file)
Given a name, equalp match string forms of name and attribute nmaes
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
Make a new state object rooted at the current node
matcher.lisp (file)
matcher.lisp (file)
The initial state consists of a virtual body
of which the current node is the top level node of
the tree. We keep track of the root of the tree.
matcher.lisp (file)
Map fn across sequential applications of body-fns for the body list of the provided state. Moves state to end of child list and returns bindings if all match
matcher.lisp (file)
matcher.lisp (file)
Linear walk of the current child list, nil on end of list
matcher.lisp (file)
Depth first tree walker. Given the current state, update the state so that (first body) contains the next node in the tree. Returns the side effected state
matcher.lisp (file)
Match current node to tag and attributes
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
Reset state to the initial state
matcher.lisp (file)
Is this tag a search variable?
matcher.lisp (file)
Modify state to make the first node of the current node’s body the current node and record the state of the current body variable to the path variable. When we pop, we the next node is at the top so we push the rest of the current body
matcher.lisp (file)
Modify state to make the current node the first
matcher.lisp (file)
matcher.lisp (file)
Return a symbol minus the leading character
matcher.lisp (file)
Return a symbol minus the leading character
matcher.lisp (file)
Return the base string by stripping the leading character
matcher.lisp (file)
Ensure that two tags are equal
matcher.lisp (file)
Same as tgen-merge-children but records the list of bindings from the body-fns to variable in a fresh bindings set
matcher.lisp (file)
Find a node by tag and attributes and bind via tgen-match. State points to the child node after the bound node
matcher.lisp (file)
Like tgen-find, but uses tgen-match-bind
matcher.lisp (file)
Try to match the current node to tag & attributes if body-fn is satisfied and return any bound attributes. Moves parse state to the next child node.
matcher.lisp (file)
Match node and add a reference to it to the bindings. Parse state is unchanged. Relies on tgen-match debug info
matcher.lisp (file)
Returns: result from calling function Side Effect: next-child
matcher.lisp (file)
Find the nth match for the provided state assuming body-fn moves the state to the next relevant node to test. Basically it’s a closure that when it’s called, recursively calls body-fn until counter hits zero and returns the last value of body-fn
matcher.lisp (file)
Returns: binding with variable matched to regex register result or nil
matcher.lisp (file)
Returns: t when it matches
matcher.lisp (file)
Matches anything and binds it to variable in a fresh binding Returns: bindings
matcher.lisp (file)
Assumes the parse tree is looking at the first element of a tag body
and that the body-fns are required sequential matches. Walks children
until (current-node subtree) is null or all body-fns have been processed.
Merges all the bindings returned from each body-fn. Each body-fn goes to
next-child.
matcher.lisp (file)
matcher.lisp (file)
Identify matching variables by leading #?
matcher.lisp (file)
Next: Internal structures, Previous: Internal functions, Up: Internal definitions [Contents][Index]
automatically generated reader method
bindings.lisp (file)
automatically generated writer method
bindings.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
matcher.lisp (file)
Recursively walk the template, generating nested matcher functions
matcher.lisp (file)
Previous: Internal generic functions, Up: Internal definitions [Contents][Index]
matcher.lisp (file)
structure-object (structure)
print-object (method)
parser-state-tree (function)
(setf parser-state-tree) (function)
parser-state-path (function)
(setf parser-state-path) (function)
parser-state-body (function)
(setf parser-state-body) (function)
Previous: Definitions, Up: Top [Contents][Index]
• Concept index | ||
• Function index | ||
• Variable index | ||
• Data type index |
Next: Function index, Previous: Indexes, Up: Indexes [Contents][Index]
Jump to: | C F L |
---|
Jump to: | C F L |
---|
Next: Variable index, Previous: Concept index, Up: Indexes [Contents][Index]
Jump to: | (
A B C F G H I L M N P R S T V W |
---|
Jump to: | (
A B C F G H I L M N P R S T V W |
---|
Next: Data type index, Previous: Function index, Up: Indexes [Contents][Index]
Jump to: | *
B P S T |
---|
Jump to: | *
B P S T |
---|
Previous: Variable index, Up: Indexes [Contents][Index]
Jump to: | B C P S |
---|
Jump to: | B C P S |
---|