Designing An NLP System
AI-C Project Phases
Contact me regarding AI-C
Brains vs. AI
"Programs" in the brain.
Should AI work like our brains?
Advantages of the computer over the brain.
Perfect memory of virtually infinite duration
Unlimited memory capacity.
Capacity for work.
More efficient "programs"
The Cortex table
The Links List Box
The Links Table Window
Creating new Cortex entries
Changing an entry
Deleting an entry
Numeric and date entries
External Reference menu
Reviews of NLP-Related Books
AI/NLP Related Web Links and Books
In addition to being the documentation for the AI-C Lookup program, this document explains the background and design of the AI-C database. It also helps keep me on track, which is its main purpose; I refer back here often to refresh my memory on how and why things are done in the AI-C Lookup program. It also contains the documentation for the AI-C Lookup program and an analysis of its Visual Basic 6 program code. Finally, the last part of this document is just various information, thoughts, and other notes about AI/NLP which I keep here for easier reference.
Natural Language Processing (NLP) is the foundation of general Artificial Intelligence (AI). For purposes of this document and project, NLP-AI (or usually just AI) refers to the ability of a computer program to at least match the human brain's ability to reason and communicate. This is sometimes referred to as Natural Language Understanding rather than just Natural Language Processing, but when I use NLP in this document, I mean it to refer to understanding text, not just processing it.
The ultimate goal of a general AI would include all the senses (sight, hearing, even touch, smell and taste), but that goes beyond the current scope of this project.
Due to the success of IBM's Watson program beating the top players in TV's Jeopardy, some people think that NLP is off and running, but the problem is that what Watson was doing had little in common with natural (conversational) language. It was not maintaining a dialogue, it was translating a relatively standard form of input into key words which would allow it to look up data in its massively extensive database of mostly trivia and respond in a standard way.
AI-C is a long, long way from completion, but some of the already-existing data may be of use to others who are working on their own programs, even for those who may not agree with my overall approach. In that regard, even I sometimes don't agree with some past approach I've taken and as a result, change the way I do things.
The current version of the program also contains many tools for working with words, as described herein.
The AI-C Lookup program is just a tool for working on the AI-C database. The program is NOT the point nor purpose of AI-C, although the routines in the program should eventually be helpful in writing code for using AI-C as an NLP-AI program.
No copyrights are attached to anything by me on the AeyeC web site. Any of this can be adapted in any way you wish. If anyone wants to adopt the AI-C database and approach, that is great too. Either way, I am more than happy to cooperate with others on AI/NLP.
There is NOT just one way of doing anything in AI-C. I have found myself constantly wanting to add this disclaimer to this documentation and to the program code, but I have instead displayed it prominently here and hope that serves for the whole docs/code.
The structure of the databases lends itself to an almost endless variety of approaches. This document and the source code show how I am doing things, so naturally these are the methods I think are best, but because the databases are so flexibly designed, there are an almost endless number of different approaches and techniques which could be used. I've found that the most difficult and time-consuming aspect of NLP is getting words indexed into the databases. Once they are in, it is easy to change the way they are indexed or anything else about the structure of the databases.
Most of the NLP projects I've seen over the last 20+ years have involved trying to acquire real-world knowledge (often called "common sense data") by getting people to enter facts either as sentences or into forms.
A more recent project by the name of NELL acquire data by reading pages on the Internet. NELL appears to be focused on gathering facts about known entities, such as people and places, so their efforts are only marginally applicable to working on full AI/NLP.
However the data is acquired, every project I've seen stores data as words and sentences (usually definitions).
More recently, much attention has been given to trying to figure out exactly how the brain functions as a model for creating a brain-like AI.
Relatively little is known about exactly how the brain works to create "intelligence", but one thing that is known is that knowledge is not stored in the brain in the forms of sentences. Instead, the visual images of words are stored in one place and the sounds of words, in another place, but knowledge is represented in the brain through the trillions of linking synapses.
Likewise, it makes more sense to design an AI which stores the relatively few words separately from the (eventually) trillions of links starting from those words which would create an AI rather than storing sentences. This is the approach taken with AI-C.
The design idea behind AI-C is to have a Cortex (table) which has no text in it. Words in AI-C are stored in a Words table and linked into the Cortex via each word's ID number in the Words table. (There are other types of tables with text in them, such as Pronunciation, and they work the same way.)
Entries in the Cortex table linking to text and numbers in other tables will make up only a very small percentage of the Cortex; the vast majority of entries will be linking entries within the Cortex to each other.
Note that entry #5 does not point directly to any words. It points to other entries which (eventually) point to words.
This allows you to reuse the entry for big - red with other words or linked sets of words, such as:
#7: WordID#5 (engine - noun)
#8: 6 (fire) - 7 (engine)
#10: WordID#6 (fireman - noun)
#11: WordID#7 (rides - verb)
#12: 10 (fireman) - 11 (rides)
#13: 12 (fireman rides) - 9 (big red fire engine)
As you can see, the last entry combines 6 words with only two pointers. In addition, this approach allows you to search for any combination of these words, such as finding all things which are big-and-red:
Compare the method above to a knowledge base made up of "facts" in sentences, such as:
Because the individual words are not indexed, the only way to find "things which are big and red" is to look through each entry in the database. Even if an AI numerically links main words to sentences, such as linking child, plays, big, red, ball to the sentence above, it takes twice the space - once to store each word as part of the sentence and again to store the link to each word.
For an example of a variety of linked entries, look up aardvark in the AI-C Lookup program. One entry, #126141: aardvark's excellent (sense of) hearing and good (sense of) smell locate prey links 20 different Cortex entries. Each of these words can be searched for individually or using any combination of the words, and such a search would find not only entry 126141 but all other entries matching the search criteria (or in other words, finding other things with similar characteristics).
Another problem with storing sentences in a database or even with linking words together instead of numeric pointers is that you cannot indicate tense, part of speech, or specific meaning of a word. For example, is "read" present tense or past tense in "read a book"? How about "wishes for spring" -- is "wishes" a verb or a noun? And is "spring" the season or something like "spring in his step"?
In AI-C's Cortex, since any entry or series of entries can be linked to another with a single ID#, tense, POS, meaning, etc., can be indicated with an entry such as 9001 <for>9002 where 9001 would be "wishes [noun]" and 9002 would be "spring <type of> season".
I have been working on this project in one form or another since the late 1980's, starting over several times. One time I actually completed the addition of my list of 120,000+ words to a database, but I decided that the approach (using syllables instead of words as the basic unit of text) was not right, so I started over from scratch using words as the basic unit of text storage. That was the point where I started the first phase of the current design. Obviously, I'm not pressured by any kind of timetable nor profit motive. I just do this for the challenge and the fun of it.
The first phase was creating the vocabulary database described below where words are linked to parts of speech, to pronunciations and to their syllabifications. On Oct.1, 2009, this phase was completed to the extent of the word lists available to me at the time. Of course, the addition of new words will be an unending process.
At Oct.1, 2009, the Words table had over 136,000 entries in it, although a significant percentage of these are forms of root words; that is, the table includes multiple forms of verbs, adjectives, etc. For example, while the verb "to play" may have one basic meaning, its forms in the Words table include "play, played, playing, plays". And of course you also have "play, plays" - noun and noun plural - and "play" as an adjective (e.g.: "play money").
The second phase started with parsing definitions from dictionaries and linking the results in the Cortex. When all the definitions are done, the plan at this time for the next phase is to parse entries from Wikipedia and add that information to the database via linking. While a dictionary provides a very brief definition of a word, an encyclopedia gets into much greater detail, which is what we want in AI-C.
As a result, I am skipping words for now which are rarely seen or heard in normal use. These can be entered at a later date, but I hope to be able to get enough other words entered to start using the database for NLP.
If anyone has a strong interest in NLP and is a fairly accomplished wordsmith and is looking for a time-consuming hobby, this is it. As of March 2013, I'm still working my way through the a's. I just spent two days entering the definitions and related links for air (noun and verb). It would be easy enough to split up the work -- just pick a letter you would like to start with. Again, to be clear, this is not a paid position, it is a hobby.
The third phase, having created a solid foundation of language data, would be to have AI-C read text from various sources (mainly the Internet) and integrate the data into its Cortex (and related tables).
The next phases would be to put AI-C to whatever use is possible after having completed the other phases. I am 66 (as of Nov. 2012) and would be pleasantly surprised to live long enough to see this phase (or even the third phase) in gear.
The files linked below are not there (other than the VB6 runtimes) because I work on this project every day (at times) and it is too hard to keep everything cleaned up, archived, and uploaded on a regular basis.
Contact me (see next section) and I will upload the latest files. When asking for the files, it would be appreciated if you would provide some information about your interest, what you plan on doing with the files, and how you learned about this web site.
AI-C Database in Access 2007 format.
AI-C Lookup program VB6 source code and executable.
If you do not have VB6, you may need to install the
VB6 runtime modules to get the
This file you are reading serves as the documentation. It is included with the files above.
FordSoft File Viewer
The reason for listing it here is that it contains the AI-C spelling corrector
The zip file includes the VB6 source code and executable, though to run it,
Spelling Suggestion code
My name is Nelson Ford. You can contact me via the email form linked below.
In the brain, the Cortex (or neo-Cortex) is a 6-layer sheathing that encases the rest of the brain and is responsible for higher level brain functions. It is essentially an enormous collection of memory locations (sort of) with a massive amount of parallel interconnection between them. From this comes everything that we are and that we know and think -- our non-artificial intelligence.
Brainiacs seem to agree that the Cortex is in the business of searching for patterns in its memory network with which to respond to input, and creating new patterns when appropriate ones are not found. When patterns are found, they are used to predict what is coming. This is what allows us humans to react to things in a timely manner.
However, nobody knows exactly how all of this works, which is what makes trying to create Artificial Intelligence which mimics it so difficult. But to beat on what should already be a dead horse, it is known for certain that word spellings are stored in one area of the brain (and the sounds and sights of words) while combinations of concepts are linked together separately and only link to the words themselves when needed for external communication, which is how AI-C (and no other NLP database to my knowledge) is designed.
Even if we could create a huge database which resembles the brain's Cortex, it would not do anything unless we wrote a program to drive it. The program driving our Cortex is in our DNA. (This is not to be confused with "programs", such as for playing chess, which must somehow be stored in our brains as a very complex set of interconnected memory locations.)
This DNA Brain program is not trivial. Scientists believe that the compressed version of the code (removing redundancy) is 30MB-100MB. This seems extremely large, given that the code only governs the framework for how the brain stores and processes data; that is, it does not contain programs for performing specific real-world tasks (such as playing chess).
When I started this project, I certainly didn't anticipate having 30MB-100MB of compressed code with which to manage the Cortex data. I was under the impression that essentially all of the work of the Cortex was done in the Cortex itself.
Like the brain, AI-C is made up of two parts -- the database which contains all the knowledge and the programs which work with the knowledge, but we have the advantage of being able to write programs for specific, complex tasks in stand-alone apps rather than by wiring together data points the way the brain does. Again I refer to the example of a chess program.
No. The brain, for all its wondrousness, has also been the generator of unspeakable horrors. The brain is the seat of insanity, degeneracy, irresponsibility, self-destructiveness, and on and on. It would be folly to manufacture an AI which works exactly as does the brain and not expect the same types of problems.
Also, the brain is simply wired in a way which makes some methods essential for the brain which would not make sense for a computer. For example, in this research (paragraph 9), the scientists found that words that sound alike are linked in the brain. It would be a waste of space and effort to link them in a database because we can quickly and easily look up sound-alike words (in the Pronunciation table) at any time - something which the brain cannot quickly and easily do.
And yet, a lot of attention is currently (Dec.2009) being given to "reverse engineering" the brain for AI, including IBM's supposed emulation of a cat's brain.
What really matters is to meet and exceed the output of the brain without the deficiencies of the brain. An analogy would be the manufacture of synthetic lumber for building decks, among other things. The goal is not to manufacture boards which are identical in all respects to lumber from trees, but to create artificial boards which serve the same purpose in deck building as lumber from trees, but not rot, splinter, warp, need to be resealed regularly, etc., the way "natural" boards do.
The computer has numerous and very large advantages, so if it can be made to achieve the same (desirable) type of output as the brain, but with the advantages of the computer, it will be vastly better than the brain.
The brain receives data input (externally or internally), looks for patterns which relate to the input, analyzes those patterns to try to predict what's coming next, stores and retrieves data, and generates output. These are all tasks at which the computer can excel.
The brain absorbs spoken input at about the same speed as it absorbs printed input, whereas a computer can absorb printed input (by "printed", I mean electronic text in ASCII format) many, MANY times faster than spoken input.
A single communication within a computer is millions of times faster than within the brain. And although the brain has the advantage of its communication being massively parallel, we may not be far from achieving that in computers.
When it comes to storing data, the basic unit of text storage which AI-C must handle is the word, or in special cases, a word fragment. Of course, the computer software below the database level does keep track of individual characters and somewhere in the computer hardware, code is even required to draw a character on the screen, but that is not something that the AI normally must worry about.
The brain's cortex does not have the ability to store a word as a single unit. It is not even able to store individual letters. Neural coding is not completely understood, much less how the cortex stores information which it assembles into letters and words.
The fact that the cortex stores small elements of letters is what allows the brain to recognize a letter in any font in which the elements are recognizable. Web sites use this ability as a security measure, distorting text so that computers cannot make out the letters, but the brain can, so a human user can enter the text to pass the security test.
So it is possible that the brain's basic unit of word storage (for the printed form of a word) is the linked-together general features of each letter making up the word (though not necessarily the correct set of letters, when dealing with a word which was only heard and not seen spelled out). On the other hand, because the brain can store data using parallel connections, it does not have to recall letters (or letter fragments) one unit at a time. Instead, it probably has words stored like this:
| | | | | | | | |
(i) (m) (p) (r) (o) (v) (i) (s) (e)
In addition to the printed form of a word, the cortex links together neurons in a way that stores the auditory form of words. In fact, as children, this is obviously what we learn and the brain stores before we learn to read.
While parallel linking to each letter in a word may seem to be the equivalent of a single link to the whole word (as we do in AI-C), the problem with the former is that the brain's links for any of the individual letters may weaken and some or all of what makes the word may be forgotten. That brings up the next point:
Perfect memory of virtually infinite duration
The brain's ability to remember data is dependent upon the number of times its related links are reinforced by repeating the data (and possibly by brain functions which take place while we sleep), and those links will still weaken over time if not reinforced in the future (although some evidence suggests that lifelong memories may be linked differently). In extreme contrast, if a link is created a single time to a piece of datum in the computer, AI-C will remember it forever.
This leads to a significant difference in the superior way to store data in the brain and in the computer. The brain must get around the fact that it will gradually lose data whose links are not sufficiently reinforced and/or when it must make room for new data. For example, most scientists believe that rather than store every possible verb form, the cortex uses "rules" to create regular verb forms and stores only the irregular forms. Likewise, rather than storing every "pixel" in a visual image, it stores only key elements of images. One reason it does these things is that the more it must remember, the more it will eventually forget. Another issue is that the brain has a limited amount of data storage space, as discussed below.
In contrast, the computer never forgets anything, so it can store ALL verb forms and only has to use rules to create verb forms for new verbs before storing them for the new verb. The advantage is that it is much faster to recall stored verb forms than to recalculate them each time, particularly since the words are indexed for immediate retrieval the first time they are stored. Recalling stored data is also more reliable than recomputing verb forms, reassembling whole pictures from parts, etc.
Computer memory can be backed up, both locally and in distant locations away from local disasters, and restored if the computer's memory is damaged and/or must be replaced. In contrast, data in the brain can be lost forever due to brain damage either from outside trauma or internal disease, strokes, tumors, aneurisms, etc.
Unlimited memory capacity.
The ability of the brain to forget data (due to weakening links) can be viewed as a necessity for a brain with limited (albeit huge) room to store and recall data -- it cannot keep adding data to the cortex indefinitely. (See http://www.physorg.com/news185720165.html.)
The computer has no such limitation, so it has no great need to let data fade in order to make room for new data. Still, it is easy enough to mimic the strengthening and weakening of links in AI-C, if need be, by linking to usage frequency counters within AI-C.
Capacity for work.
The brain must be put into a sleep state for about 6-8 hours a day. Less sleep can cause reduced brain performance. Total sleep deprivation can cause even more extreme mental problems and even death. We don't really know why this is or exactly what the brain is doing during this down time. Current theory is that functions take place which strengthen new data links formed during the day.
Even with the 8 hours sleep, the brain cannot work on problems non-stop the other 16 hours a day, yet a computer's AI can run 24/7, non-stop, on a single problem if necessary. With multiple processors, it can work on many problems, as well as routine chores, at the same time, and do so continually at its highest level of performance, never getting tired, sick, or distracted.
More efficient "programs"
As previously mentioned, the greatest advantage of the computer may be that we (and in the future, AI-C) have the ability to write programs to accomplish tasks which our brains can only mimic by wiring together data points (similar to circa 1950's machine coding before the first programming language was invented).
While the overall functioning of the brain is controlled by our DNA, a "program" in our brains to, say, play chess, is not a program but an extremely complex set of interconnected neurons. It boggles my mind to even think of such an approach. In contrast, a program for a computer to play chess can be written for that specific purpose. At one time, a disadvantage of computer game programs was that they did not allow the computer to learn by experience and adapt, but we are seeing more adaptive game programs being written these days, though even without that ability, IBM was able to develop a program which could beat top chess international grandmaster.
Before I started working on AI-C, I was working on a bridge-playing program which stored all of the data for its decision making in data files rather than being hard-coded into a program. A major advantage of this approach is that the program can modify its own decision-making parameters. I finished the bidding part before deciding that AI-C was a more beneficial use of my time.
In the 1980's, I wrote CardShark Hearts and CardShark Spades, but they used hard-coded algorithms. I had intended to make their algorithms self-modifiable, but they began to win at a high enough rate without doing so. And after all, nobody really wants to play against a program they can never beat.
Some people may argue that humans can also run specialized programs, use calculators, etc., but that is not at all comparable. For the computer, specialized programs are, for all practical purposes, part of the computer's "brain" in that they are instantly accessible and can be directly integrated into the AI's "thought processes".
The current general design of AI-C's database was created with Microsoft Access97 and later updated to Access 2007 and is no longer readable by Access97. (Even earlier versions were written with Btrieve.) Any further references to Access refer to the 2007 version. AI-C can be viewed and edited with Access, although using computer software written specifically for viewing and editing the AI-C data, such as the AI-C Lookup program documented below, is a much more efficient method.
I am presently converting the database to SQLite for use on portable Android devices.
If needed, the data could be provided in comma-delimited text files which could easily be imported into any database, including older versions of Access. Either way, others should be able to use their choice of programming languages with the data.
The maximum size of an Access 2007 database is 2GB. As this is being written, the Cortex has about 120,000 entries and the total size of the database is about 80MB. Other tables make up some of that 80MB, but even so, it seems likely that the database will hold only about 2,500,000 entries before it hits 2GB.
Dec.2010 note: the Cortex has only gone up by about 8000 entries, but the database has increased in size to over 1.1GB. When I noticed the database getting big, I split the Words table, the Syllables table, and the Pronunciation table and a bunch of other tables off to a new database called AIC Words, which is 88.6MB, leaving just the Cortex, LinkTypes, Numbers, and Shapes tables in AIC Cortex, which is now only 14MB. The Words tables (those with text in them) should not get a whole lot larger (although they could if other languages are added to thenm). Most important at this point is that 128,000 Cortex entries is only taking 14MB, meaning that about 18.3 million Cortex entries could fit in one 2GB database.
Another limit to be faced some day is that the fields in the Cortex are all long integers, which can only be numbers up to just over 2 billion, so even if the 2GB database limit problem is solved somehow, at some point the fields would then need to be changed to double precision and the database would then double in size, allowing only half as many entries in the same size data file..
IBM's cat brain emulator took over 147 thousand CPUs and 144 TB of main memory, which for the human brain could translate to over 3000 TB. So 2GB looks pretty skimpy for the long run, but it gives me a lot of room to work for now and I'll worry about what to do next when I run out of that space. It is highly unlikely that I could ever make 2.5 million entries manually, so the 2GB limit is safe at least until I can automate the importing of text into linked entries in the database.
IBM's Watson program (for playing "Jeopardy") has at its beck and call ninety IBM POWER 750 servers, 16 Terabytes of memory, and 4 Terabytes of of clustered storage. This is enclosed in ten racks including the servers, networking, shared disk system, and cluster controllers. These ninety POWER 750 servers have four POWER7 processors, each with eight cores. IBM Watson has a total of 2880 POWER7 cores. IBM’s POWER 750’s scalable design is capable of filling execution pipelines with instructions and data, keeping all the POWER7 processor cores busy. At 3.55 GHz, each of Watson’s POWER7 on-chip bandwidth is 500 Gigabytes per second. The total on-chip bandwidth for Watson’s 360 POWER7 processors is an astounding 180,000 Gigabytes per second
Not your daddy's IBM-PC, eh?
Finally, even if I am unable to crack the 2GB barrier, that should be enough room for me to complete enough of a Cortex database to be able to determine if the approach I'm using will work as an NLP-AI. If it does work and nobody has come up with anything better by then, perhaps someone else will be able to take it beyond 2GB.
The Cortex table's data structure was designed to be as simple (and thus as flexible) as possible. A record consists of the following numeric fields:
The five fields in the Cortex are all numeric (long integers, at this time). Text is stored in other tables, such as Words, Syllables, and Pronunciation, which are discussed in more detail further down. Numbers and dates can be stored in the Numbers table.
The StartID# and NextID# are usually pointers to a pair of entries which are related in some way, and each (or neither) of which may, in turn, point to other related entries.
The LinkType normally indicates the type of relationship between the two entries, and in some cases, the data table in which the source of the link resides or the type of data being entered, etc. However, LinkTypes are meaningless unless software has been written to recognize the LinkTypes and take appropriate action, hence all the usually's and normally's above. This will be discussed in more detail later.
In March 2010, I added a Date Entered field. I didn't list it above because it is an automatic field and not essential to the Cortex. It may be handy to be able to see when entries were made, but because database space may become an issue at some point, it is a field that could be dropped in the future or moved into its own database.
In August 2010, I added a Freq field. Again, I have not shown it above because it is not an essential field. It is designed to aid in selecting the most likely correction when a word is entered which is not in the database. Because the usage frequency is based on the part of speech for each word, When parsing input, it will also help in deciding which part of speech a word is most likely being used for.
See Spell Corrector under Using the AI-C Lookup Program: Look-Up buttons for more information.
Each word in the Words table has a unique ID#. For a word to be usable in the Cortex data table, its WordID# must be entered along with its Part Of Speech (POS) in the LinkType field. Many, if not most, words have multiple parts of speech. An entry must be made in the Cortex for each WordID-POS combination, such as mail - noun and mail - verb.
When entering a different form of a verb, adjective, adverb, or noun plural, the CortexID# of the root word is entered in the StartID field. (It may be useful to link a different POS as the root of a word, such as prince, noun, being the root of princely, adjective, and princely, adverb. I didn't think of this until Phase 1 was completed, so only a few such entries made after Phase 1 have links to roots with different POS's.)
When adding a WordID to the Cortex, syllabification and pronunciation entries are made in their respective tables.
The Cortex does not use a word without a POS. A word cannot be defined or used without knowing its POS, so there is no point in adding it to the Cortex without the POS, though I use POS loosely because it could be an abbreviation, a prefix, etc. (See the LinksType table.)
The Words table contains my initial, cleaned-up list of words from different sources, but not all words in the Words table were added to the Cortex. In some cases, it is because a word was a misspelling or not found in dictionaries or is not grammatical, such as a plural form for a mass/uncountable noun or, similarly, a comparative form for an adjective which is not comparable. But there could also be some perfectly good words which just didn't get added for whatever reason.
How to enter words into the Cortex will be discussed in depth under LinkType table, below.
Tables in AI-C have a Tag field which can be used by software to mark fields for later review, normally by a human. For example, run the subroutine FindDiffPrns to tag different pronunciations of the same word and the Tag will be used to mark such entries.
Another maintenance field in the Pronunciation table is Ver which is used to indicate that a pronunciation has been verified. Computed pronunciations can be wrong because rules are not guaranteed to apply to all words, so knowing that a pronunciation has been verified is helpful in choosing between two pronunciations, computing new pronunciations, etc.
You can delete those fields (or add others) if you wish, because...
There is NOT just one way of doing things in the Cortex. Normally, the Cortex does not use a word without a POS, but that's just the way I am doing it now. A programmer could write a routine using particular LinkTypes which link to text without POS's. Likewise, there may be many different ways to link related Cortex entries together. When linking two entries, many times it does not matter significantly which entry is in the Start and which is in the Next. And in the long run, the AI-C itself will likely reorganize links for optimum efficiency anyway.
AI-C has what is probably the simplest possible database design (just five essential fields in the main table) and is available in the simplest possible formats (Access2007 or straight ASCII) with which you (assuming you program) can do anything you want using any programming language you want. Even building off the existing database and software, it should easily be possible to add fields to tables, add tables to the database, and even add new databases to the project, then incorporate them into AI-C by adding new LinkTypes which let you write code for dealing with them.
Each record in the Words table consists of an entry ID#, a text field which can be up to 50 characters, and a Soundex field (and currently, a word frequency field). Since the Words table was designed to hold only single words (i.e.: not phrases), it seemed like 50 characters should be plenty, but it can easily be changed.
There are no duplicate entries in the Text field, although capitalization counts so that Ford (brand of a car) and ford (crossing a river) are NOT considered duplicates.
However, I cannot set the index of text in the Words table to "no duplicates", because with that setting, Access ignores case and would not allow the two entries. I tried following Access' instructions for making the database software ignore case, but their suggestions did not work for me. Therefore, before adding a word to the Words table, it is necessary to check to make sure it is not already there, since the database is set to allow duplicates even though we don't want them.
Similarly, if you check to see if a word is already in the Words table and the database engine says it is, check to make sure that the capitalization is the same. That is, if you search for "ford" and the Words table finds "Ford", it will tell you that a match has been found. If the case does not match, continue searching to see if it finds the word with matching case.
Many years ago I was writing an HTML editor (which, in fact, I am using to write this document) and wanted to add spell-checking. To do this, I searched the Web for word lists. I found a bunch, but they were all pretty junky. I compiled them, cleaned them up, ran them through other spell checkers, and ended up with a list of about 100,000 words.
When I (re)started this NLP project, I began with that list for the Words table, adding to it when needed and creating entries in the Syllables, Pronunciation, and Cortex POS tables. Not every word in the Words table was linked into the Cortex. Words (in the Words table) without such entries should be considered suspect -- either misspellings or made up; however, they could be obscure, but legitimate words. Unused words were left in the table because they don't take up much room.
Common misspellings can be included in the database and linked to the proper spellings with a LinkType of misspelling. A spelling corrector added to the AI-C Lookup program in mid-2010 has proven to be so accurate that it almost always can find the intended word from any normal types of misspellings, so the entries for misspellings are probably not necessary.
In the late 1980's, I wrote an English-Spanish dictionary by the name of Ventanas: Spanish For Windows. (I guess that was my attempt at being clever since for you non-Spanish speakers, the word ventanas is literally the Spanish word for windows. And at the time the program was written, many Windows programs were identified as being for Windows since Windows was still relatively new.)
As of March 22, 2010, I had not looked at it for a long time, so I tried running it under Windows 7 and the program still runs (in Visual Basic 3). Looking at it again was funny because I had absolutely no recollection of how it was designed, so it was like looking at someone else's program. In the late 1990's, I wrote an updated version of the software (using the same basic database of words) that listed categories, synonyms, and more.
The programs have some interesting features, such as showing POSs, style, synonyms for both the English and Spanish words, other translations of the word in each language, words/phrases with the word's root, and full conjugations of verbs. But the most interesting feature of all at this time is the ability to export all the data to a text file, which will allow me to import it into AI-C at some point (but not right now). It was also of interest to be able to see where I was with this kind of project over 20 years ago.
The fact that Ventana's database has a pretty long list of English words in it makes me think that this may have been my original database of words for AI-C, contrary to what I said above, though I have no memory of it. Ah, well. Not having a memory is what keeps things fresh! (In case it's not obvious, I'm kinda old.)
If interested, you can download Ventanas by right-clicking on the link and saving to disk. This is the first version. I never finished and circulated the second version, though the program runs. I'll upload it and its later database when I get the chance. I'll also upload all the source code. If you Google, you may still be able to find places where you can download the actual VB3 programming language, though I don't know if you can compile with it under Win7+, nor why you would want to do so.
Where to put names and other languages:
Right now (March 2013) proper names and vocabularies of other languages are in separate tables. It just seems cleaner to have them this way rather than mixing everything into the Words table. However, I've been experimenting with the Lookup code to see how it works with an unlimited number of separate tables and it is basically a mess since each table has to be searched individually and it is easier to get duplicate entries for the same text (such as a person's last name and a company name).
So it appears that there is little choice but to put all text into the Words table.
Errors, ambiguities, and vagueness.
Even large, big-name dictionaries have errors, inaccuracies, ambiguities (see the Random House definition at the end of this document), and inconsistencies in them. In the course of this project, I have found hundreds and hundreds of basic errors (i.e.: typos, circular references, etc.) in such sources. This is understandable as these are very large works assembled by humans, and we all make mistakes. Unlike a computer database, dictionary publishers have no automatic way to enforce consistency or to verify accuracy.
The Cortex database may also have errors in it, having been created by humans (giving myself the benefit of the doubt), but with the difference that unlike printed dictionaries, the database can easily be corrected and over time, errors winnowed out. I have also written numerous routines which the computer can run to look for and correct some types of errors.
But even if errors get into the Cortex, it doesn't mean that they will be used. Before the Cortex can be used for NLP, such software will have to parse documents, wiki's, etc., and translate such text into linked concepts in the Cortex. Once the Cortex hits a critical mass, new text will be understandable to AI-C by examining links in the Cortex. If a word or link is incorrect, it will probably never become interlinked with the rest of the Cortex, so it will never be used; or if used and is recognized by a human as wrong, it can easily be corrected.
Finally, as mentioned elsewhere, a significant percentage of what people communicate is incorrect -- either wrong words, misspelled words, improper grammar, redundant words, or excluded words. On top of that you can add faulty logic and incorrect "facts", either by accident or on purpose. It is not enough that AI-C can understand proper English (or other language); like humans, it must understand what humans are trying to say.
Entries for major Prefixes and Suffixes are included in the Words list and are given a prefix or suffix POS link entry in the Cortex. I debated doing this, but decided it might prove useful and couldn't really hurt, particularly for prefixes like un- and non-, which can be used with hundreds (if not thousands) of words to make new words. Actually, any kind of bits of text (such as other parts of words) can be stored in the Words table for use by the Cortex.
In addition to the above, the Lookup program, as part of its spell checking, uses a subroutine (FindAffixes) which looks for common suffixes and prefixes on "words" entered which are not in the Words table. It then suggests, based on the affixes found, what might have been the intended meaning of the word entered. For example, if "wiseful" is entered, the suggested correction is "very wise". (The comic strip Get Fuzzy is a mother lode of such words.)
Verb forms, such as past tense, present participles/gerund, and 3rd-person singular, have been included in the Words table, even though the book Speech and Language Processing, considered by many to be the Bible of NLP, says: the idea of listing every noun and verb [form] can be quite inefficient.
While it is true that space could have been saved by using rules for regular word forms instead of entering all the forms, the Words table is very small relative to what the Cortex will ultimately become.
Having all noun, adjective, and verb forms in the Words table should simplify (and thus speed up) parsing sentences and finding words, which is far more important saving a little disk space.
Here's an example: What is the present tense word for the past tense word: indebted?