This is the langutils Reference Manual, version 1.0, generated automatically by Declt version 4.0 beta 2 "William Riker" on Sun Sep 15 04:09:06 2024 GMT+0.
langutils/langutils.asd
langutils/src/package.lisp
langutils/src/config.lisp
langutils/src/tokens.lisp
langutils/src/reference.lisp
langutils/src/stopwords.lisp
langutils/src/my-meta.lisp
langutils/src/tokenize.lisp
langutils/src/lexicon.lisp
langutils/src/lemma.lisp
langutils/src/porter.lisp
langutils/src/contextual-rule-parser.lisp
langutils/src/tagger-data.lisp
langutils/src/tagger.lisp
langutils/src/chunker-constants.lisp
langutils/src/chunker.lisp
langutils/src/concept.lisp
langutils/src/init.lisp
The main system appears first, followed by any subsystem dependency.
langutils
Language utilities
Ian Eslick
BSD
1.0
s-xml-rpc
(system).
stdutils
(system).
src
(module).
Modules are listed depth-first from the system components tree.
langutils/src
langutils
(system).
package.lisp
(file).
config.lisp
(file).
tokens.lisp
(file).
reference.lisp
(file).
stopwords.lisp
(file).
my-meta.lisp
(file).
tokenize.lisp
(file).
lexicon.lisp
(file).
lemma.lisp
(file).
porter.lisp
(file).
contextual-rule-parser.lisp
(file).
tagger-data.lisp
(file).
tagger.lisp
(file).
chunker-constants.lisp
(file).
chunker.lisp
(file).
concept.lisp
(file).
init.lisp
(file).
Files are sorted by type and then listed depth-first from the systems components trees.
langutils/langutils.asd
langutils/src/package.lisp
langutils/src/config.lisp
langutils/src/tokens.lisp
langutils/src/reference.lisp
langutils/src/stopwords.lisp
langutils/src/my-meta.lisp
langutils/src/tokenize.lisp
langutils/src/lexicon.lisp
langutils/src/lemma.lisp
langutils/src/porter.lisp
langutils/src/contextual-rule-parser.lisp
langutils/src/tagger-data.lisp
langutils/src/tagger.lisp
langutils/src/chunker-constants.lisp
langutils/src/chunker.lisp
langutils/src/concept.lisp
langutils/src/init.lisp
langutils/src/config.lisp
package.lisp
(file).
src
(module).
*auto-init*
(special variable).
*config-paths*
(special variable).
*default-concise-stopwords-file*
(special variable).
*default-contextual-rule-file*
(special variable).
*default-lexical-rule-file*
(special variable).
*default-lexicon-file*
(special variable).
*default-stems-file*
(special variable).
*default-stopwords-file*
(special variable).
*default-token-map-file*
(special variable).
*report-status*
(special variable).
handle-config-entry
(function).
read-config
(function).
relative-pathname
(function).
write-log
(macro).
langutils/src/tokens.lisp
config.lisp
(file).
src
(module).
get-token-count
(function).
id-for-token
(function).
ids-for-tokens
(function).
string->token-array
(function).
suspicious-string?
(function).
suspicious-word?
(method).
token-for-id
(function).
tokens-for-ids
(function).
*add-to-map-hook*
(special variable).
*external-token-map*
(special variable).
*id-for-token-hook*
(special variable).
*id-table*
(special variable).
*max-token-nums*
(constant).
*max-token-others*
(constant).
*suspicious-words*
(special variable).
*token-counter*
(special variable).
*token-counter-hook*
(special variable).
*token-dirty-bit*
(special variable).
*token-for-id-hook*
(special variable).
*token-table*
(special variable).
*tokens-load-file*
(special variable).
*whitespace-chars*
(constant).
add-external-mapping
(function).
add-to-map-hook
(function).
ensure-token-counts
(function).
id-for-token-hook
(function).
ids-for-string
(function).
initialize-tokens
(function).
reset-token-counts
(function).
token-counter-hook
(function).
token-for-id-hook
(function).
langutils/src/reference.lisp
tokens.lisp
(file).
src
(module).
add-word
(method).
altered-phrase
(class).
change-word
(method).
change-word
(method).
document-annotations
(reader method).
(setf document-annotations)
(writer method).
document-tags
(reader method).
(setf document-tags)
(writer method).
document-text
(reader method).
(setf document-text)
(writer method).
find-phrase
(method).
find-phrase-intervals
(method).
find-phrase-intervals
(method).
get-annotation
(method).
get-annotation
(method).
get-tag
(method).
get-tag
(method).
get-tag
(method).
get-token-id
(method).
get-token-id
(method).
get-token-id
(method).
lemmatize-phrase
(method).
lemmatize-phrase
(method).
length-of
(method).
make-alterable-phrase
(method).
make-phrase
(function).
make-phrase-from-sentence
(function).
make-phrase-from-vdoc
(function).
make-vector-document
(function).
phrase
(class).
phrase->string
(method).
phrase->token-array
(method).
phrase-distance
(method).
phrase-document
(method).
phrase-document
(reader method).
(setf phrase-document)
(writer method).
phrase-end
(method).
phrase-end
(reader method).
(setf phrase-end)
(writer method).
phrase-equal
(method).
phrase-lemmas
(method).
phrase-length
(method).
phrase-length
(method).
phrase-overlap
(method).
phrase-start
(method).
phrase-start
(reader method).
(setf phrase-start)
(writer method).
phrase-type
(reader method).
(setf phrase-type)
(writer method).
phrase-words
(function).
print-object
(method).
print-phrase
(method).
print-phrase-lemmas
(method).
print-vector-document
(method).
print-window
(method).
read-vector-document
(method).
read-vector-document-to-string
(method).
remove-word
(method).
remove-word
(method).
set-annotation
(method).
set-annotation
(method).
string-tag
(function).
string-tag-tokenized
(function).
unset-annotation
(method).
unset-annotation
(method).
vector-document
(function).
vector-document
(class).
vector-document-string
(method).
vector-document-words
(method).
write-vector-document
(method).
*temp-phrase*
(special variable).
*test*
(special variable).
altered-phrase-custom-document
(reader method).
(setf altered-phrase-custom-document)
(writer method).
copy-phrase
(method).
document-window-as-string
(method).
make-document-from-phrase
(method).
person-token-offset
(function).
phrase-annotations
(reader method).
(setf phrase-annotations)
(writer method).
print-token-array
(function).
temp-phrase
(function).
token-array->words
(function).
vector-doc-as-ids
(method).
vector-doc-as-words
(method).
langutils/src/stopwords.lisp
reference.lisp
(file).
src
(module).
concise-stopword?
(function).
contains-is?
(function).
stopword?
(function).
string-concise-stopword?
(function).
string-contains-is?
(function).
string-stopword?
(function).
*concise-stopwords*
(special variable).
*is-token*
(special variable).
*s-token*
(special variable).
*stopwords*
(special variable).
clean-stopwords
(function).
init-concise-stopwords
(function).
init-stopwords
(function).
init-word-test
(function).
langutils/src/my-meta.lisp
stopwords.lisp
(file).
src
(module).
disable-meta-syntax
(function).
enable-meta-syntax
(function).
print-object
(method).
with-list-meta
(macro).
with-stream-meta
(macro).
with-string-meta
(macro).
*meta-readtable*
(special variable).
*saved-readtable*
(special variable).
compile-list
(function).
compileit
(function).
copy-meta
(function).
list-match
(macro).
list-match-type
(macro).
make-meta
(function).
meta
(structure).
meta-char
(reader).
(setf meta-char)
(writer).
meta-form
(reader).
(setf meta-form)
(writer).
meta-p
(function).
meta-reader
(function).
stream-match
(macro).
stream-match-type
(macro).
string-match
(macro).
string-match-type
(macro).
symbol-name-equal
(function).
langutils/src/tokenize.lisp
my-meta.lisp
(file).
src
(module).
tokenize-stream
(function).
tokenize-string
(function).
alpha
(type).
alpha-lower
(type).
alpha-lowercase
(function).
alpha-misc
(function).
alpha-upper
(type).
alpha-uppercase
(function).
alphanum
(type).
digit
(type).
end-of-sentence
(condition).
known-abbreviations
(special variable).
non-digit
(type).
non-digit-or-ws
(type).
non-punc-or-white
(type).
non-whitespace
(type).
punctuation
(type).
tokenize-file2
(function).
whitespace
(type).
langutils/src/lexicon.lisp
tokenize.lisp
(file).
src
(module).
get-lexicon-case-forms
(function).
get-lexicon-default-pos
(function).
(setf get-lexicon-entry)
(setf expander).
get-lexicon-entry
(function).
lexicon-entry
(structure).
lexicon-entry-id
(reader).
(setf lexicon-entry-id)
(writer).
lexicon-entry-roots
(reader).
(setf lexicon-entry-roots)
(writer).
lexicon-entry-surface-forms
(reader).
(setf lexicon-entry-surface-forms)
(writer).
lexicon-entry-tag
(function).
lexicon-entry-tags
(reader).
(setf lexicon-entry-tags)
(writer).
*lexicon*
(special variable).
add-basic-entry
(function).
add-root
(function).
add-root-forms
(function).
add-roots
(function).
add-surface-form
(function).
add-unknown-lexicon-entry
(function).
clean-lexicon
(function).
copy-lexicon-entry
(function).
ensure-lexicon-entry
(function).
init-lexicon
(function).
lexicon-entry-case-forms
(reader).
(setf lexicon-entry-case-forms)
(writer).
lexicon-entry-p
(function).
make-cases
(function).
make-lexicon-entry
(function).
set-lexicon-entry
(function).
with-static-memory-allocation
(macro).
langutils/src/lemma.lisp
lexicon.lisp
(file).
src
(module).
get-lemma
(function).
get-lemma-for-id
(function).
in-pos-class?
(function).
lemmatize
(method).
lemmatize
(method).
morph-case-surface-forms
(function).
morph-surface-forms
(function).
morph-surface-forms-text
(function).
*get-determiners*
(function).
*pos-class-map*
(special variable).
select-token
(function).
langutils/src/porter.lisp
lemma.lisp
(file).
src
(module).
langutils/src/contextual-rule-parser.lisp
porter.lisp
(file).
src
(module).
*contextual-rule-args*
(special variable).
def-contextual-rule-parser
(macro).
gen-rule-arg-bindings
(function).
gen-rule-arg-decls
(function).
gen-rule-closure
(function).
gen-rule-closure-decl
(function).
gen-rule-match
(function).
get-bind-entry
(function).
langutils/src/tagger-data.lisp
contextual-rule-parser.lisp
(file).
src
(module).
apply-rules
(function).
guess-tag
(function).
load-contextual-rules
(function).
load-lexical-rules
(function).
make-contextual-rule
(function).
make-lexical-rule
(function).
langutils/src/tagger.lisp
tagger-data.lisp
(file).
src
(module).
clean-tagger
(function).
init-tagger
(function).
initial-tag
(function).
read-and-tag-file
(function).
read-file-as-tagged-document
(function).
tag
(function).
tag-tokenized
(function).
vector-tag
(function).
vector-tag-tokenized
(function).
*tagger-bigrams*
(special variable).
*tagger-contextual-rules*
(special variable).
*tagger-lexical-rules*
(special variable).
*tagger-wordlist*
(special variable).
apply-contextual-rules
(function).
default-tag
(function).
duplicate-from
(function).
load-tagger-files
(function).
read-file-to-string
(function).
return-vector-doc
(function).
test-vector-tag-tokenized
(function).
write-temp
(function).
langutils/src/chunker-constants.lisp
tagger.lisp
(file).
src
(module).
adv-pattern
(constant).
noun-pattern
(constant).
p-pattern
(constant).
verb-pattern
(constant).
langutils/src/chunker.lisp
chunker-constants.lisp
(file).
src
(module).
chunk
(function).
chunk-tokenized
(function).
get-adverb-chunks
(method).
get-event-chunks
(method).
get-extended-event-chunks1
(method).
get-extended-event-chunks2
(method).
get-imperative-chunks
(method).
get-nx-chunks
(method).
get-p-chunks
(method).
get-pp-chunks
(method).
get-vx-chunks
(method).
head-verb
(function).
head-verbs
(function).
root-noun
(function).
root-nouns
(function).
*common-verbs*
(special variable).
all-vx+nx-phrases
(function).
ensure-common-verbs
(function).
get-basic-chunks
(method).
test-phrase
(function).
langutils/src/concept.lisp
chunker.lisp
(file).
src
(module).
associate-concepts
(function).
concat-concepts
(method).
concept->string
(method).
concept->token-array
(method).
concept->words
(method).
concept-contains
(method).
conceptually-equal
(method).
conceptually-equal
(method).
conceptually-equal
(method).
conceptually-equal
(method).
force-concept
(function).
make-concept
(function).
phrase->concept
(function).
print-object
(method).
string->concept
(function).
token-array->concept
(function).
token-vector
(reader method).
words->concept
(function).
*concept-store-scratch-array*
(special variable).
*concept-vhash*
(special variable).
clear-concept-cache
(method).
concept
(class).
ensure-concept
(function).
lookup-canonical-concept-instance
(method).
lookup-canonical-concept-instance
(method).
register-new-concept-instance
(method).
test-concept-equality
(function).
langutils/src/init.lisp
concept.lisp
(file).
src
(module).
clean-langutils
(function).
init-langutils
(function).
reset-langutils
(function).
Packages are listed by definition order.
my-meta
common-lisp
.
disable-meta-syntax
(function).
enable-meta-syntax
(function).
with-list-meta
(macro).
with-stream-meta
(macro).
with-string-meta
(macro).
*meta-readtable*
(special variable).
*saved-readtable*
(special variable).
compile-list
(function).
compileit
(function).
copy-meta
(function).
list-match
(macro).
list-match-type
(macro).
make-meta
(function).
meta
(structure).
meta-char
(reader).
(setf meta-char)
(writer).
meta-form
(reader).
(setf meta-form)
(writer).
meta-p
(function).
meta-reader
(function).
stream-match
(macro).
stream-match-type
(macro).
string-match
(macro).
string-match-type
(macro).
symbol-name-equal
(function).
langutils
common-lisp
.
stdutils
.
add-word
(generic function).
altered-phrase
(class).
associate-concepts
(function).
change-word
(generic function).
chunk
(function).
chunk-tokenized
(function).
clean-langutils
(function).
clean-tagger
(function).
concat-concepts
(generic function).
concept->string
(generic function).
concept->token-array
(generic function).
concept->words
(generic function).
concept-contains
(generic function).
conceptually-equal
(generic function).
concise-stopword?
(function).
contains-is?
(function).
document-annotations
(generic reader).
(setf document-annotations)
(generic writer).
document-tags
(generic reader).
(setf document-tags)
(generic writer).
document-text
(generic reader).
(setf document-text)
(generic writer).
find-phrase
(generic function).
find-phrase-intervals
(generic function).
force-concept
(function).
get-adverb-chunks
(generic function).
get-annotation
(generic function).
get-event-chunks
(generic function).
get-extended-event-chunks1
(generic function).
get-extended-event-chunks2
(generic function).
get-imperative-chunks
(generic function).
get-lemma
(function).
get-lemma-for-id
(function).
get-lexicon-case-forms
(function).
get-lexicon-default-pos
(function).
(setf get-lexicon-entry)
(setf expander).
get-lexicon-entry
(function).
get-nx-chunks
(generic function).
get-p-chunks
(generic function).
get-pp-chunks
(generic function).
get-tag
(generic function).
get-token-count
(function).
get-token-id
(generic function).
get-vx-chunks
(generic function).
head-verb
(function).
head-verbs
(function).
id-for-token
(function).
ids-for-tokens
(function).
in-pos-class?
(function).
init-langutils
(function).
init-tagger
(function).
initial-tag
(function).
lemmatize
(generic function).
lemmatize-phrase
(generic function).
length-of
(generic function).
lexicon-entry
(structure).
lexicon-entry-id
(reader).
(setf lexicon-entry-id)
(writer).
lexicon-entry-roots
(reader).
(setf lexicon-entry-roots)
(writer).
lexicon-entry-surface-forms
(reader).
(setf lexicon-entry-surface-forms)
(writer).
lexicon-entry-tag
(function).
lexicon-entry-tags
(reader).
(setf lexicon-entry-tags)
(writer).
make-alterable-phrase
(generic function).
make-concept
(function).
make-phrase
(function).
make-phrase-from-sentence
(function).
make-phrase-from-vdoc
(function).
make-vector-document
(function).
morph-case-surface-forms
(function).
morph-surface-forms
(function).
morph-surface-forms-text
(function).
phrase
(class).
phrase->concept
(function).
phrase->string
(generic function).
phrase->token-array
(generic function).
phrase-distance
(generic function).
phrase-document
(generic function).
(setf phrase-document)
(generic writer).
phrase-end
(generic function).
(setf phrase-end)
(generic writer).
phrase-equal
(generic function).
phrase-lemmas
(generic function).
phrase-length
(generic function).
phrase-overlap
(generic function).
phrase-start
(generic function).
(setf phrase-start)
(generic writer).
phrase-type
(generic reader).
(setf phrase-type)
(generic writer).
phrase-words
(function).
print-phrase
(generic function).
print-phrase-lemmas
(generic function).
print-vector-document
(generic function).
print-window
(generic function).
read-and-tag-file
(function).
read-file-as-tagged-document
(function).
read-vector-document
(generic function).
read-vector-document-to-string
(generic function).
remove-word
(generic function).
reset-langutils
(function).
root-noun
(function).
root-nouns
(function).
set-annotation
(generic function).
stopword?
(function).
string->concept
(function).
string->token-array
(function).
string-concise-stopword?
(function).
string-contains-is?
(function).
string-stopword?
(function).
string-tag
(function).
string-tag-tokenized
(function).
suspicious-string?
(function).
suspicious-word?
(generic function).
tag
(function).
tag-tokenized
(function).
token-array->concept
(function).
token-for-id
(function).
token-vector
(generic reader).
tokens-for-ids
(function).
unset-annotation
(generic function).
vector-document
(function).
vector-document
(class).
vector-document-string
(generic function).
vector-document-words
(generic function).
vector-tag
(function).
vector-tag-tokenized
(function).
words->concept
(function).
write-vector-document
(generic function).
*add-to-map-hook*
(special variable).
*auto-init*
(special variable).
*common-verbs*
(special variable).
*concept-store-scratch-array*
(special variable).
*concept-vhash*
(special variable).
*concise-stopwords*
(special variable).
*config-paths*
(special variable).
*contextual-rule-args*
(special variable).
*default-concise-stopwords-file*
(special variable).
*default-contextual-rule-file*
(special variable).
*default-lexical-rule-file*
(special variable).
*default-lexicon-file*
(special variable).
*default-stems-file*
(special variable).
*default-stopwords-file*
(special variable).
*default-token-map-file*
(special variable).
*external-token-map*
(special variable).
*get-determiners*
(function).
*id-for-token-hook*
(special variable).
*id-table*
(special variable).
*is-token*
(special variable).
*lexicon*
(special variable).
*max-token-nums*
(constant).
*max-token-others*
(constant).
*pos-class-map*
(special variable).
*report-status*
(special variable).
*s-token*
(special variable).
*stopwords*
(special variable).
*suspicious-words*
(special variable).
*tagger-bigrams*
(special variable).
*tagger-contextual-rules*
(special variable).
*tagger-lexical-rules*
(special variable).
*tagger-wordlist*
(special variable).
*temp-phrase*
(special variable).
*test*
(special variable).
*token-counter*
(special variable).
*token-counter-hook*
(special variable).
*token-dirty-bit*
(special variable).
*token-for-id-hook*
(special variable).
*token-table*
(special variable).
*tokens-load-file*
(special variable).
*whitespace-chars*
(constant).
add-basic-entry
(function).
add-external-mapping
(function).
add-root
(function).
add-root-forms
(function).
add-roots
(function).
add-surface-form
(function).
add-to-map-hook
(function).
add-unknown-lexicon-entry
(function).
adv-pattern
(constant).
all-vx+nx-phrases
(function).
altered-phrase-custom-document
(generic reader).
(setf altered-phrase-custom-document)
(generic writer).
apply-contextual-rules
(function).
apply-rules
(function).
clean-lexicon
(function).
clean-stopwords
(function).
clear-concept-cache
(generic function).
concept
(class).
consonantp
(function).
copy-lexicon-entry
(function).
copy-phrase
(generic function).
cvc
(function).
def-contextual-rule-parser
(macro).
default-tag
(function).
document-window-as-string
(generic function).
doublec
(function).
duplicate-from
(function).
ends
(function).
ensure-common-verbs
(function).
ensure-concept
(function).
ensure-lexicon-entry
(function).
ensure-token-counts
(function).
gen-rule-arg-bindings
(function).
gen-rule-arg-decls
(function).
gen-rule-closure
(function).
gen-rule-closure-decl
(function).
gen-rule-match
(function).
get-basic-chunks
(generic function).
get-bind-entry
(function).
guess-tag
(function).
handle-config-entry
(function).
id-for-token-hook
(function).
ids-for-string
(function).
init-concise-stopwords
(function).
init-lexicon
(function).
init-stopwords
(function).
init-word-test
(function).
initialize-tokens
(function).
lexicon-entry-case-forms
(reader).
(setf lexicon-entry-case-forms)
(writer).
lexicon-entry-p
(function).
load-contextual-rules
(function).
load-lexical-rules
(function).
load-tagger-files
(function).
lookup-canonical-concept-instance
(generic function).
m
(function).
make-cases
(function).
make-contextual-rule
(function).
make-document-from-phrase
(generic function).
make-lexical-rule
(function).
make-lexicon-entry
(function).
noun-pattern
(constant).
p-pattern
(constant).
person-token-offset
(function).
phrase-annotations
(generic reader).
(setf phrase-annotations)
(generic writer).
print-token-array
(function).
r
(function).
read-config
(function).
read-file-to-string
(function).
register-new-concept-instance
(generic function).
relative-pathname
(function).
reset-token-counts
(function).
return-vector-doc
(function).
select-token
(function).
set-lexicon-entry
(function).
setto
(function).
stem
(function).
step1ab
(function).
step1c
(function).
step2
(function).
step3
(function).
step4
(function).
step5
(function).
temp-phrase
(function).
test-concept-equality
(function).
test-phrase
(function).
test-vector-tag-tokenized
(function).
token-array->words
(function).
token-counter-hook
(function).
token-for-id-hook
(function).
vector-doc-as-ids
(generic function).
vector-doc-as-words
(generic function).
verb-pattern
(constant).
vowelinstem
(function).
with-static-memory-allocation
(macro).
write-log
(macro).
write-temp
(function).
langutils-tokenize
common-lisp
.
my-meta
.
tokenize-stream
(function).
tokenize-string
(function).
alpha
(type).
alpha-lower
(type).
alpha-lowercase
(function).
alpha-misc
(function).
alpha-upper
(type).
alpha-uppercase
(function).
alphanum
(type).
digit
(type).
end-of-sentence
(condition).
known-abbreviations
(special variable).
non-digit
(type).
non-digit-or-ws
(type).
non-punc-or-white
(type).
non-whitespace
(type).
punctuation
(type).
tokenize-file2
(function).
whitespace
(type).
Definitions are sorted by export status, category, package, and then by lexicographic order.
get-lexicon-entry
(function).
set-lexicon-entry
(function).
Return the list of phrase/list/token-arrays as pairs with the first element being the original and the second being a canonicalized concept instance
Returns a phrase-list for the provided text
Returns a phrase-list for the provided tokenized string
Identifies id as a ’concise-stopword’ word.
concise-stopwords are a *very* small list of words. Mainly pronouns and determiners
Tests list of ids for ’is’ words
Provides the root word string for the provided word string
Returns a lemma id for the provided word id. pos only returns the root for the provided pos type. noun will stem nouns to the singular form by default and porter determines whether the porter algorithm is used for unknown terms. pos type causes the noun argument to be ignored
Return the current token counter
This takes string ’tokens’ and returns a unique id for that character sequence - beware of whitespace, etc.
Return an initial tag for a given token string using the langutils lexicon and the tagger lexical rules (via guess-tag)
id
.
tags
.
Take two arrays of test and tags and create a phrase that points at a vdoc created from the two arrays
All cases of morphological surface forms of the provided root
Takes a word or id and returns all surface form ids or all forms of class ’pos-class’ where pos-class is a symbol of langutils::V,A,N
Create a canonical concept from an arbitrary phrase by removing determiners and lemmatizing verbs.
Identifies id as a ’stopword’
Check the word if it is a ’concise-stopword’ word.
concise-stopwords are a *very* small list of words. Mainly pronouns and determiners
Checks the list for a string containing ’is’
Tokenizes and tags the string returning
a standard tagged string using ’/’ as a separator
Determine if the alpha-num and number balance is reasonable for lingustic processing or if non-alpha-nums are present
Return a string token for a given token id
Converts a stream into a string and tokenizes, optionally, one sentence
at a time which is nice for large files. Pretty hairy code: a token
processor inside a stream scanner. The stream scanner walks the input stream
and tokenizes all punctuation (except periods). After a sequences of
non-whitespace has been read, the inline tokenizer looks at the end of the
string for mis-tokenized words (can ’ t -> ca n’t)
Returns a fresh, linguistically tokenized string
Return a list of string tokens for each id in ids
Returns a ’document’ which is a class containing a pair of vectors representing the string in the internal token format. Handles arbitrary data.
Returns a document representing the string using the
internal token dictionary; requires the string to be tokenized.
Parses the string into tokens (whitespace separators) then populates
the two temp arrays above with token id’s and initial tags. Contextual
rules are applied and a new vector document is produced which
is a copy of the enclosed data. This is all done at once so good
compilers can open-code the array refs and simplify the calling
of the labels functions.
altered-phrase
) index word tag) ¶altered-phrase
) index new-token &optional new-pos) ¶vector-document
)) ¶automatically generated reader method
vector-document
)) ¶automatically generated writer method
vector-document
)) ¶automatically generated reader method
tags
.
vector-document
)) ¶automatically generated writer method
tags
.
vector-document
)) ¶automatically generated reader method
text
.
vector-document
)) ¶automatically generated writer method
text
.
phrase
) (doc vector-document
) &key match start ignore-start ignore-end lemma concept-terms) ¶Find the specified phrase starting at start, matching text and/or tags according to match. The lemma parameter indicates whether the phrases match under the lemma operator and ignore-start and ignore-end causes the search to not match a region within the document
array
) (doc vector-document
) &key match start lemma concept-terms) ¶Find all phrase intervals in the vector document
phrase
) (doc vector-document
) &key match start lemma concept-terms) ¶Find all phrase intervals in the vector document
vector-document
) &optional interval) ¶Return a list of all adverbial phrases
phrase
) key) ¶First returned value is the association value or null if none. The second is true if the key exists, nil otherwise
vector-document
) key) ¶First returned value is the association value or null if none. The second is true if the key exists, nil otherwise
vector-document
) &optional interval) ¶Return vx+nx (simple verb arg) phrase objects
vector-document
) &optional interval) ¶Return vx+nx+pp... objects
vector-document
) &optional interval) ¶Return vx+nx+pp... objects
vector-document
) &optional interval) ¶vector-document
) &optional interval) ¶Return a list of all nx phrases
vector-document
) &optional interval) ¶Return a list of all prepositions as phrases
vector-document
) &optional interval) ¶Return a list of all prepositions as phrases
altered-phrase
) index) ¶vector-document
) offset) ¶altered-phrase
) index) ¶vector-document
) offset) ¶vector-document
) &optional interval) ¶Return a list of all primitive vx phrases - no arguments
altered-phrase
) &optional offset) ¶Destructive lemmatization of a phrase
vector-document
)) ¶altered-phrase
)) ¶altered-phrase
)) ¶altered-phrase
)) ¶altered-phrase
)) ¶vector-document
) &key stream with-tags with-newline) ¶vector-document
) &key with-tags) ¶altered-phrase
) index) ¶phrase
) key value &key method) ¶Add an annotation to object using method :override, :push, :duplicate-key
vector-document
) key value &key method) ¶Add an annotation to object using method :override, :push, :duplicate-key
fixnum
)) ¶Find a suspicious word using it’s token id
vector-document
) key) ¶vector-document
) &key with-tags with-newline) ¶vector-document
)) ¶vector-document
) filename &key with-tags if-exists) ¶:custom-document
change-word
.
conceptually-equal
.
conceptually-equal
.
conceptually-equal
.
copy-phrase
.
find-phrase
.
find-phrase-intervals
.
get-annotation
.
get-tag
.
get-token-id
.
lemmatize-phrase
.
make-alterable-phrase
.
make-document-from-phrase
.
phrase->string
.
phrase->token-array
.
(setf phrase-annotations)
.
phrase-annotations
.
phrase-distance
.
(setf phrase-document)
.
phrase-document
.
(setf phrase-end)
.
phrase-end
.
phrase-equal
.
phrase-lemmas
.
phrase-length
.
phrase-overlap
.
(setf phrase-start)
.
phrase-start
.
(setf phrase-type)
.
phrase-type
.
print-object
.
print-phrase
.
print-phrase-lemmas
.
print-window
.
remove-word
.
set-annotation
.
unset-annotation
.
common-lisp
.
:type
:document
:start
:end
:annotations
(setf document-annotations)
.
document-annotations
.
(setf document-tags)
.
document-tags
.
(setf document-text)
.
document-text
.
find-phrase
.
find-phrase-intervals
.
find-phrase-intervals
.
get-adverb-chunks
.
get-annotation
.
get-basic-chunks
.
get-event-chunks
.
get-extended-event-chunks1
.
get-extended-event-chunks2
.
get-imperative-chunks
.
get-nx-chunks
.
get-p-chunks
.
get-pp-chunks
.
get-tag
.
get-token-id
.
get-vx-chunks
.
length-of
.
print-vector-document
.
read-vector-document-to-string
.
set-annotation
.
unset-annotation
.
vector-doc-as-ids
.
vector-doc-as-words
.
vector-document-string
.
vector-document-words
.
write-vector-document
.
(array fixnum)
:text
(array symbol)
:tags
list
:annotations
The maximum number of numbers allowed in a valid token
The maximum number of non alpha-numeric characters in a valid token
Whether to call initialize-langutils when the .fasl is loaded
Allows us to lookup concepts from arrays without allocating lots of unnecessary data
The templates for parsing contextual rules and constructing matching templates over word/pos arrays
Path to a *very* small list of words. Mainly pronouns and determiners
Path to the brill contextual rule file
Path to the brill lexical rule file
Path to the lexicon file
Path to the word stems file
Path to a stopwords file
Path to the token map file
Where to print langutils messages; default to none
Memoize known suspicious words that have been tokenized in this hash
Bigram hash (not implemented yet)
Table to hold the contextual rule closures
Table to hold the lexical rule closures
Wordlist hash (not implemented yet)
Given a list of structures, defines a generator named ’name’ that takes
a Brill contextual rule list (list of strings) and generates an applicable
closure. The closure accepts an argument list of (tokens tags offset) and will
apply the rule and related side effect to the two arrays at the provided
offset. Patterns are to be given in the form:
("SURROUNDTAG" (match (0 oldtag) (-1 tag1) (+1 tag2)) =>
(setf oldtag newtag))
Add a word and it’s probability ordered tags to the lexicon
Add a root form to word if not exists
Set the root list (pairs of pos_type/root) for the entry for ’word’
Add a surface form to a root word
Overly hairy function for finding all vx phrases that are followed by nx. Get event chunks is a better way to do this.
Return T if the given character is an alpha character
Apply rules to the values in values presuming that the returned list is also a list of values that can be passed to the next rule
Simple default tagging based on capitalization of token string
Reset token count if not already set
Generate let bindings for the args referenced in the match pattern
Generate type declarations for canonical variables from table entry
Generate the code for the rule closure as one of the cond forms matching the name of the closure pattern to the rule pattern
Optimize the compiled closure through type and optimization declarations
Generate the conditional code to match this rule
Given a canonical variable name, create its let binding and extraction expression from the rule file entry
Using rules in rule-table guess the tag of the token ’token’
Populates the lexicon with ’word tag1 tag2’ structured lines from lexicon-file
Return a list of closure implementing the lexical rules in rule-file to tag words not found in the lexicon
Look through list for rule name
char
.
form
.
Reset all the token datastructures to an initialized but empty state
Internal per-token function
Prints all the phrases found in the text for simple experimenting
Tokenizes a pure text file a sentence at a time
altered-phrase
)) ¶automatically generated reader method
altered-phrase
)) ¶automatically generated writer method
vector-document
) &optional interval) ¶Returns a list of PHRASEs referencing ’doc’ for all supported primitive phrase types
vector-document
)) ¶Converts the word array to ids with shared structure
for the other elements; keeps the data ’in the family’
so the source or destination documents should be short lived
vector-document
)) ¶Stores the representation of the concept as an array of token ids
(array fixnum)
:token-vector
This slot is read-only.
Jump to: | (
*
A C D E F G H I L M P R S T U V W |
---|
Jump to: | (
*
A C D E F G H I L M P R S T U V W |
---|
Jump to: | *
A C D E F I K N P R S T V |
---|
Jump to: | *
A C D E F I K N P R S T V |
---|
Jump to: | A C D E F I L M N P R S T V W |
---|
Jump to: | A C D E F I L M N P R S T V W |
---|