Language Modelling NetLogo Model
Produced for the book series "Artificial Intelligence";
Author: W. J. Teahan; Publisher: Ventus Publishing Aps, Denmark.
powered by NetLogo
view/download model file: Language-Modelling.nlogo
WHAT IS IT?
This model shows how a language model can be constructed from some training text. The type of language model is a fixed order Markov model - that is, according to the Markov property, the probability distribution for the next step and future steps depends only on the current state of the system and does not take into consideration the system's previous state. The language model is also fixed order - that is, the probability is conditioned on a fixed length prior context.
Note: There is some possible confusion of the term "model" for this NetLogo "model". NetLogo uses the term "model" to refer to a program/simulation. This can be confused with the use of the term "model" in the phrase "language model" which is a mechanism for assigning a probability to a sequence of symbols by means of a probability distribution. In the information listed below, the two types of models will be explicitly distinguished by using the phrases "NetLogo model" and "language model".
WHAT IS ITS PURPOSE?
The purpose of this NetLogo model is to show various important features of language models and to visualise them using NetLogo link and turtle agents.
HOW IT WORKS
The data being modelled is conceptualised as a stream of events (i.e. analogously to Event Stream Processing or ESP) where events occur with a specific order for a specific stream, but at the same time can occur simultaneously on separate streams.
A training text sequence is used to train the language model. This language model is then used to process a test text sequence. Each stream event from the training data is represented by a state turtle agent, and paths between states are represented by path link agents. When the language model is used for an application, such as processing a test sequence of text, then walker turtle agents are used to walk the network of states and links to show at what point the current contexts are active. Text turtle agents are used for maintaining information about a particular text such as sums and a variable for storing the language model.
HOW TO USE IT
Press the setup button in the Interface to reset all agents. To load a language model, first select which text you wish to load from the which-text chooser, then press the load-model-from-text button. This will load the language model into the environment so that it can be visualised using the layout specified by the layout-type chooser. If you have selected a spring layout-type, then pressing the change-layout button can sometimes remove some of the clutter once it has been loaded. Other language models from different texts can be subsequently loaded if needed. The pick-these-streams and pick-these-events input boxes can be used to filter out the language model so that only the part of the language model that matches the specific streams and events are visualised. The text-event chooser specifies what type of event should be loaded. The NetLogo model at the moment only processes character and word events.
To see how the language model(s) process a testing text sequence, first edit the test-text input box and insert the test text sequence that you would like to be processed, then press the go-walkers button. The paths taken through the language model(s) will then be highlighted as the testing text sequence is processed sequentially.
THE INTERFACE
The NetLogo model's Interface buttons are defined as follows:
- setup: This (re-) initialises all the variables.
- change-layout: This changes the layout to move the agents apart if the layout-type slider is set to "spring".
- go-walkers: If a language model is visualised in the environment, then this will run an animation that shows how the testing sequence (specified within the test-text input box) is processed using the language model.
- kill-all-walkers: This kills all current walkers.
- load-model-from-text: This will load a language model using the training text specified by the which-text chooser. The type of events (characters or words) is specified by the text-event chooser.
The NetLogo model's Interface sliders, and choosers are defined as follows:
- layout-type: This specifies the type of layout to use for the visualisation - either spring or radial.
- which-text: This specifies the training text to use to train the language models.
- text-event: This specifies the type of event in the training text being processed - either characters or words.
- max-tree-depth: This specifies the maximum tree depth of the language model (the order of the language model is equal to the max-tree-depth + 1).
- max-text-size: This specifies the maximum number of characters to be processed from the training text. A relatively low number is recommended, otherwise there will be too much to visualise.
- window-size: This sets the window size over which the NetLogo model tries to classify the text using a crude form of text categorisation. In this case, the activity count of the language models is maintained (where an activity is the highlighting of a crossed link during the animation), and the most active language model is shown in the Most Active Recently in Window monitor. The category of the testing sequence can then be set to the language label of the most active language model. i.e. If there were French, English and Spanish language models loaded from various training texts, and the most active language model shown by the monitor was the French language model, then the text can be categorized as being French.
The NetLogo model's Interface switches are defined as follows:
- setup-char-models?: If set to On, this sets up the models to recognize words and punctuation from characters.
- show-counts?: If set to On, this will show the counts in the visualisation of the language models.
- debug-trace?: If set to On, this will print out copious amounts of debug information to show the code being executed.
The NetLogo model's Interface input boxes, plot and monitor are defined as follows:
- pick-these-streams: This can be used to prune out what is visualised of each language model. Only the specified streams will be visualised.
- pick-these-events: This can be used to prune out what is visualised of each language model. Only the specified events will be visualised.
- test-text: This is the testing text that is being processed sequentially.
- Total of paths traversed plot: This plots the number of paths traversed in the language model versus the number of events processed.
- Most Active Recently in Window: This is the most active model in the past window (of length window-size) in the testing text.
THINGS TO NOTICE
Notice that word-based language models are significantly smaller in complexity when visualised than the complexity of the character language models. This is to be expected since there are a lot less word events in a text than character events (for example, there are on average for the English language 5 characters per word).
Try loading several language models all together. Then notice how when a testing text is chosen with language similar to one of the training texts used to train the language models, then the language model that becomes most active is usually the one with the similar language. This is because language can readily be identified uniquely, and explains why it is possible to do authorship ascription using language models with a high degree of accuracy.
THINGS TO TRY
Try loading all the language models from the different texts (specified by the which-text chooser). While doing this, set different values for the order of the language models by setting the max-tree-depth slider from 1 to 8. (Be careful when this value is set high - the language models can take some time to load, and the visualisation becomes extremely cluttered).
Try extending the maximum size of the training text (using the max-text-size slider) and see what affect this has on the complexity of the language models.
Try changing the window-size slider to see what effect this has on the Most Active Recently in Window monitor.
EXTENDING THE NETLOGO MODEL
Try a variation of the NetLogo model where most of the nodes are hidden, and only the ones chosen to be visualised (by the pick-these-streams and the pick-these-events choosers) are shown in the environment. This will mean much larger language models can be loaded (since the reset-layout procedure for the spring layout type is what slows down the program when a language model is loaded). An alternative animation could also be tried where only the active part of language model is animated. i.e. Everything in the language model remains hidden until it is activated.
If entropy and code length calculations are added (like in the Cars Guessing Game NetLogo model), then the model can be extended to perform with a high degree of accuracy the task of language identification, authorship identification or text categorisation simply by choosing the class (i.e. language, author or category) of the testing text to be the class of the text used to train the language model which produces the smallest code length on the testing sequence.
RELATED MODELS
See the Central Park Events and Wall Following Events NetLogo models. For prediction using these language models, see the Shannon Guessing Game NetLogo model.
CREDITS AND REFERENCES
This NetLogo model was created by William John Teahan.
To refer to this NetLogo model in publications, please use:
Language Modelling NetLogo model.
Teahan, W. J. (2010). Artificial Intelligence. Ventus Publishing Aps
PROCEDURES
; Language Modelling model.
;
; Creates some language models and uses event maps to store and visualise them.
;
; Copyright 2010 William John Teahan. All Rights Reserved.
;
breed [texts text]
breed [states state]
breed [walkers walker]
directed-link-breed [paths path]
states-own
[ depth ;; depth in the tree
stream ;; the name of the stream of sensory or motor events
event ;; the sensory or motor event
event-count ;; the number of times te event has occurred
]
walkers-own
[ path-length ;; length of path taken
location ;; the location of a state the walker is currently visiting
model ;; the model currently being traversed
stream ;; the stream associated with the node currently being traversed
event ;; the event associated with the node currently being traversed
]
texts-own
[ model ;; the model associated with the text
total ;; total of lengths of paths traversed
window ;; window of recent path lengths
wcount ;; sum of path lengths stored in the window
]
globals
[
most-active-count
most-active-model
]
to setup
clear-all ;; clear everything
set most-active-count 0
set most-active-model ""
set pick-these-streams ""
set pick-these-events ""
set-default-shape states "circle 2"
if (setup-char-models?)
[ setup-char-recognizers ] ;; setups the models to recognize words and punctuation from characters
reset-layout
;add-events [["aaa" 1] ["bbb" 2] ["ccc" 3]]
;add-events [["eee" 4] ["fff" 5]]
;add-events [["ggg" 6] ["hhh" 7] ["iii" 8] ["jjj" 9]]
;add-events [["aaa" 1] ["bbb" 2] ["xxx" 3]]
;add-events [["eee" 4] ["fff" 5] ["yyy" 6]]
;add-events [["ggg" 6] ["hhh" 7] ["iii" 8] ["zzz" 9]]
;add-events [["ggg" 6] ["hhh" 7] ["www" 8] ["zzz" 9]]
;add-events [["ch1" "N"] ["ch2" "o"] ["ch3" "w"]]
end
to setup-char-recognizers
foreach chars-in-string " ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
[ add-events (list ["model" "word"] (list "char" ?)) magenta magenta magenta]
foreach chars-in-string "-'?!,;:\""
[ add-events (list ["model" "punctuation"] (list "char" ?)) magenta magenta magenta]
end
to start-new-walkers [this-stream]
;; creates new walkers at the start states
foreach sort states with [stream = "model"]
[
create-walkers 1
[
if (debug-trace?)
[ type "start-new-walkers: starting walker " print self ]
set color lime
set size 5
set path-length 0
set location ? ;; start at first state
set model [event] of ?
set stream this-stream
set event ""
;; set event-type backlink-event ? "type"
move-to location
]
]
end
to go-walkers
;; The walkers will walk over the states depending on what is in
;; the test-text.
let char-events []
let word-events []
let event-queue []
let this-stream ""
let this-event ""
if (count states = 0)
[ user-message "No states to walk!"
stop ]
set char-events chars-in-string test-text
;; set word-events words-in-string test-text
;; process the events in sequential order
;; foreach char-events [ update-walkers ? ]
;; foreach word-events [ update-walkers ? ]
foreach char-events
[
if (debug-trace?)
[ type "go-walkers: char event = " print ? ]
ifelse (event-queue = [])
[ set event-queue (list (list "char" ?)) ]
[ set event-queue (lput (list "char" ?) event-queue) ] ;; add onto end of the queue
if (debug-trace?)
[ type "go-walkers: event-queue = " print event-queue ]
;; user-message "Here"
foreach event-queue
[ ;; for each event on the event-queue, update all current walkers
update-walkers (first ?) (last ?)
]
plot-totals
set event-queue [] ;; reset the event queue to empty
ask walkers
[ set this-stream model ;; use the model as the stream for a new-walker
set this-event event
if (debug-trace?)
[ type "go-walkers: stream = " type model type " event = " print event ]
ask states with [stream = "model"]
[ if (count out-path-neighbors with [stream = this-stream and event = this-event] > 0)
[ set event-queue lput (list this-stream this-event) event-queue ]
]
]
]
end
to kill-all-walkers
;; kills alll the current walkers
;; (sometimes walkers can hang around from a previous execution)
ask walkers [ die ]
end
to update-walkers [this-stream this-event]
;; updates the walkers' positions by processing the stream's event
let new-location nobody
let out-paths-count 0
start-new-walkers this-stream ;; create new walkers at the start states
ask walkers
[
if (debug-trace?)
[ type "update-walkers: walker " type self type " model = " type model type " stream = " type stream type " event = " print event ]
ask location
[ set out-paths-count count out-path-neighbors with [stream = this-stream]
set new-location one-of out-path-neighbors with [stream = this-stream and event = this-event]]
if (new-location = nobody) and (stream = this-stream) and (event != this-event)
[ ;; we have a new event on the walker's stream but couldn't find anywhere to go to -
;; kill off this walker; restart a new walker if at end of path
if (debug-trace?)
[
type "update-walkers: killing walker " type self type " (out-paths-count = " type out-paths-count type ") "
type "stream = " type stream type " this-stream = " type this-stream
type " event = " type event type " this-event = " print this-event
]
update-total model path-length
if (out-paths-count != 0)
[ die ] ;; somewhere to go, but no relevant sensory input, so kill them off
;; at end of path with nowhere to go; restart at the beginning if there is
;; an out-path with (this-stream = this-event) on it
hatch-new-walkers1 model this-stream this-event
die ;; let the hatched walkers continue the work
]
if (new-location != nobody)
[
ask [link-with new-location] of location [ set thickness 2.0 ] ;; highlight link I am crossing
face new-location ;; not strictly necessary, but improves the visuals a bit
move-to new-location
set location new-location
set path-length path-length + 1
set event (word event [event] of location)
if (debug-trace?)
[ type "update-walkers: moving walker " type self type " new event = " print event ]
]
display
]
ask links [ set thickness 1.0 ] ;; reset all network links to uncrossed
end
to hatch-new-walkers [this-model this-stream this-event]
;; hatches new walkers at the start of a model
let new-start nobody
let out-paths-count 0
if (debug-trace?)
[ type "hatch-new-walkers: input - model = " type this-model type ", stream = " type this-stream type ", event = " print this-event ]
set new-start one-of states with [stream = "model" and event = this-model]
ask new-start
[ set out-paths-count count out-path-neighbors with [stream = this-stream and event = this-event]]
if (out-paths-count = 0)
[ type "hatch-new-walkers: cannot hatch walker with model = " type this-model type ", stream = " type this-stream type ", event = " print this-event ]
if (out-paths-count > 0)
[ ;; found an out-path with this-event on it
hatch-walkers 1
[ set path-length 0
set location new-start
set model this-model
set stream this-stream
set event this-event
if (debug-trace?)
[ type "hatch-new-walkers: hatching walker " type self type " with model = " type this-model type ", stream = " type this-stream type ", event = " print event ]
]
]
end
to hatch-new-walkers1 [this-model this-stream this-event]
;; hatches new walkers at the start of a model by concatenating events together
let new-start nobody
let out-paths-count 0
if (debug-trace?)
[ type "hatch-new-walkers1: input - model = " type this-model type ", stream = " type this-stream type ", event = " print this-event ]
set new-start one-of states with [stream = "model" and event = this-model]
ask new-start
[ set out-paths-count count out-path-neighbors with [stream = this-stream and event = this-event]]
if (out-paths-count = 0)
[ if (debug-trace?)
[ type "hatch-new-walkers1: cannot hatch walker with model = " type this-model type ", stream = " type this-stream type ", event = " print this-event ]
]
if (out-paths-count > 0)
[ ;; found an out-path with this-event on it
hatch-walkers 1
[ set location new-start
set event (word event this-event)
if (debug-trace?)
[ type "hatch-new-walkers1: hatching walker " type self type " with model = " type this-model type ", stream = " type stream type ", event = " print event ]
]
]
end
to-report update-window [this-window inc]
ifelse (length this-window < window-size)
[ report fput inc this-window ]
[ report fput inc but-last this-window ]
end
to-report most-active-recently
report most-active-model
end
to update-most-active [this-wcount this-model]
ifelse (this-wcount = most-active-count) and (most-active-count > 0)
[ set most-active-model (word most-active-model ", " this-model) ]
[ if this-wcount >= most-active-count
[ set most-active-count this-wcount
set most-active-model this-model ]]
end
to update-total [this-model this-path-length]
set most-active-count 0
ask texts with [model = this-model]
[ set total total + this-path-length
set window update-window window this-path-length
set wcount count-list window
update-most-active wcount model ]
end
to plot-totals
set-current-plot "Total of paths traversed"
ask texts
[
set-current-plot-pen model
plot total
]
end
to-report count-list [this-list]
;; reports the counts of the items on the list.
ifelse empty? this-list
[ report 0 ]
[ report reduce [?1 + ?2] this-list ]
end
to reset-layout
ifelse (layout-type = "spring")
[ repeat 500 [ layout-spring states paths 0.1 50 30 ]]
;else (layout-type = "radial")
[ ask states with [stream = "model"]
[ layout-radial states paths self ]]
;; leave space around the edges
ask states [ setxy 0.95 * xcor 0.95 * ycor ]
end
to change-layout
; ask links [ set thickness 1.0 ]
; ask agents [
; let new-location one-of [out-link-neighbors] of location
; ;; change the thickness of the link I will cross over
; ask [link-with new-location] of location [ set thickness 2.0 ]
; face new-location
; move-to new-location
; set location new-location
; ]
reset-layout
display
end
to-report matches-event-list [event-list]
;; returns true if the stream and event matches one on the event list.
if pick-these-streams = "" and pick-these-events = ""
[ report true ]
let this-stream ""
let this-event ""
foreach event-list
[
set this-stream first ?
set this-event last ?
; type this-stream type " = " print this-event
ifelse pick-these-streams = ""
[ ; match on events only
if position this-event pick-these-events != false
[ report true ]
]
[ ; match on streams plus possibly events as well
ifelse pick-these-events = ""
[ ; match on streams only
if position this-stream pick-these-streams != false
[ report true ]
]
[ ; match on both dimensions and events
if position this-stream pick-these-streams != false and
position this-event pick-these-events != false
[ report true ]
]
]
]
report false
end
to-report chars-in-string [string]
;; returns the characters in the string as a list.
let i 0
let chars []
repeat length string
[
set chars fput (item i string) chars
set i i + 1
]
report reverse chars
end
to-report alphanumeric [cc]
;; returns true if the character cc is alphanumeric
report (((cc >= "A") and (cc <= "Z")) or
((cc >= "a") and (cc <= "z")) or
((cc >= "0") and (cc <= "9")))
end
to-report words-in-string [string]
;; returns the words in the string as a list.
let cc ""
let words []
let this-word ""
let i 0 let p 0
let finished false
while [i < (length string) - 1]
[ ; go through each character in the line, one at a time,
; converting the string to a list of words
set p 0
set finished false
while [not finished]
[ ;; skip over non alphnumeric characters
set cc item (i + p) string
set p p + 1
if (i + p >= length string) or alphanumeric cc
[ set finished true ]
]
set i i + p - 1
set finished false
while [not finished]
[ ;; skip to end of alphanumeric sequence
set cc item (i + p) string
set p p + 1
if (i + p >= length string) or not alphanumeric cc
[ set finished true ]
]
set this-word (word substring string i (i + p - 1) " ")
set words fput this-word words
set i i + p
]
report reverse words ;; the words are in reverse order as each new word is added
;; at beginning of words to make it more efficient
end
to load-text-char-events [filename context-size this-model root-colour node-colour link-colour]
;; loads the text character events from the file.
;; context-size is the length of the window.
;; if pick-events list is non-empty, will list only those events found in the list.
let i 0 let j 0
let line "" let context ""
let event-list []
let item-list []
create-texts 1
[ set color black ; don't make this turtle visible on the screen
set model (word this-model " Chars")
set total 0
set window []
set wcount 0 ]
file-open filename
;; Read in all the data in the file
let c 0
let ch "null"
while [ not file-at-end? and (c < max-text-size - max-tree-depth + 1)]
[
set line file-read-line
;; print line
set i 0
while [i + context-size < length line and c < max-text-size - max-tree-depth + 1]
[ ;; print substring line i (i + context-size)
set event-list []
set j context-size
while [j > 0]
[ set ch item (i + j - 1) line
set item-list (list "char" ch)
ifelse (event-list = [])
[ set event-list (list item-list) ]
[ set event-list fput item-list event-list ]
set j j - 1
]
set event-list fput (list "model" (word this-model " Chars")) event-list
if matches-event-list event-list
[ add-events event-list root-colour node-colour link-colour ]
if (i mod 10 = 0)
[ reset-layout
display ]
set i i + 1
set c c + 1
]
]
file-close
reset-layout
display
end
to load-text-word-events [filename context-size this-model root-colour node-colour link-colour]
;; loads the text word events from the file.
;; context-size is the length of the window.
let i 0 let j 0
let line ""
let context ""
let words []
let event-list []
let item-list []
;; let item-list1 []
create-texts 1
[ set color black ; don't make this turtle visible on the screen
set model (word this-model " Words")
set total 0
set window []
set wcount 0 ]
file-open filename
;; Read in all the data in the file
let c 0
while [ not file-at-end? and (c < max-text-size - max-tree-depth + 1) ]
[
set line file-read-line
set words words-in-string line ;; returns the words in the line
;; now go through the list of words processing a context of context-size words at a time
set i 0
while [i + context-size <= length words and (c < max-text-size - max-tree-depth + 1)]
[
set event-list []
set j context-size
while [j > 0]
[ set item-list (list "word" item (i + j - 1) words)
ifelse (event-list = [])
[ set event-list (list item-list) ]
[ set event-list fput item-list event-list ]
set j j - 1
]
set event-list fput (list "model" (word this-model " Words")) event-list
if matches-event-list event-list
[ add-events event-list root-colour node-colour link-colour ]
if (i mod 10 = 0)
[ reset-layout
display ]
set i i + 1
set c c + 1
]
]
file-close
reset-layout
display
end
to load-text
let filename ""
let colour white
if (which-text = "English Bible")
[ set filename "Text_english_bible.txt" set colour white ]
if (which-text = "German Bible")
[ set filename "Text_german_bible.txt" set colour lime ]
if (which-text = "Shakespeare's As You Like It")
[ set filename "Text_shakespeare_as_you_like_it.txt" set colour red ]
if (which-text = "Jane Austen's Pride and Prejudice")
[ set filename "Text_austen_p_and_p.txt" set colour cyan ]
if (which-text = "Jane Austen's Sense and Sensibility")
[ set filename "Text_austen_s_and_s.txt" set colour sky ]
ifelse (text-event = "Chars")
[ load-text-char-events filename max-tree-depth which-text colour colour colour ]
[ load-text-word-events filename max-tree-depth which-text colour colour colour ]
end
to remove-model
;; removes the user specified parts of the model
let this-stream user-input "Which stream?"
let this-event user-input "Which event?"
ask states with [(stream = this-stream or this-stream = "") and
(event = this-event or this-event = "")]
[ remove-state self ]
end
to remove-state [this-state]
;; removes the state and any of its children
ask this-state
[ ask out-path-neighbors
[ remove-state self ]
die
]
end
to set-state-label
;; sets the label for the state
ifelse (show-counts?)
[ set label (word stream " = \"" event "\" [" event-count "] ") ]
[ set label (word stream " = \"" event "\" ") ]
end
to add-events [list-of-events root-colour node-colour link-colour]
;; add events in the list-of-events list to the events tree.
;; each item of the list-of-events list must consist of a two itemed list.
;; e.g. [[hue 0.9] [brightness 0.8]]
let this-depth 0
let this-stream ""
let this-event ""
let this-state nobody
let next-state nobody
let these-states states
let matching-states []
let matched-all-so-far true
foreach list-of-events
[ set this-stream first ?
set this-event last ?
;; check to see if state already exists
set matching-states these-states with [stream = this-stream and event = this-event and depth = this-depth]
ifelse (matched-all-so-far = true) and (count matching-states > 0)
[
set next-state one-of matching-states
ask next-state
[ set event-count event-count + 1
set-state-label
]
set these-states [out-path-neighbors] of next-state ]
[ ;; state does not exist - create it
set matched-all-so-far false
create-states 1
[
set depth this-depth
set stream this-stream
set event this-event
set event-count 1
set-state-label
ifelse (depth = 0)
[ set label-color root-colour ]
[ set label-color node-colour ]
set size 6
ifelse (depth = 0)
[ set color root-colour ]
[ set color node-colour ]
set next-state self
]
]
if (this-state != nobody)
[ ask this-state [ create-path-to next-state [ set color link-colour ]]]
;; go down the tree
set this-state next-state
set this-depth this-depth + 1
]
ask links [ set thickness 1.0 ]
;; traverse-backlinks next-state
end
to traverse-backlinks [this-state]
;; traverse back up the backlinks for this-state
let backlinks nobody
let ok true
while [ok = true and this-state != nobody]
[
print this-state
ask this-state [ set backlinks in-path-neighbors ]
ifelse (backlinks = nobody)
[ set ok false ]
[ set this-state one-of backlinks ]
]
end
to-report backlink-event [this-state this-stream]
;; reports the event associated with this-stream in any of the backlinks for this-state
let this-event ""
ask this-state
[
ask in-path-neighbors with [stream = this-stream]
[
set this-event event
]
]
report this-event
end
;
; Copyright 2010 by William John Teahan. All rights reserved.
;
; Permission to use, modify or redistribute this model is hereby granted,
; provided that both of the following requirements are followed:
; a) this copyright notice is included.
; b) this model will not be redistributed for profit without permission
; from William John Teahan.
; Contact William John Teahan for appropriate licenses for redistribution for
; profit.
;
; To refer to this model in publications, please use:
;
; Teahan, W. J. (2010). Language Modelling NetLogo model.
; Artificial Intelligence. Ventus Publishing Aps.
;
