| ||||||||||||||||||||
Contents:
Designing A NLP System AI/NLP Textbooks AI-C's Design AI-C Project Phases Downloads Contact me regarding AI-C Brains vs. AI
"Programs" in the brain. Should AI work like our brains? Advantages of the computer over the brain. Perfect memory of virtually infinite duration Unlimited memory capacity. Capacity for work. More efficient "programs" The Cortex table
Words table LinkTypes Table Pronunciation Table Numbers table Shapes Table Sources Table "z" Tables
Set-up Word look-up's Phrases The Links List Box The Links Table Window Creating new Cortex entries Entering Definitions Changing an entry Deleting an entry Numeric and date entries External Reference menu Tools menu Miscellaneous Information Time Reviews of NLP-Related Books AI/NLP Related Web Links and Books
Purpose of AI-CIn addition to being the documentation for the AI-C Lookup program, this document explains the background and design of the AI-C database. It also helps keep me on track, which is its main purpose; I refer back here often to refresh my memory on how and why things are done in the AI-C Lookup program. It also contains the documentation for the AI-C Lookup program and an analysis of its Visual Basic 6 program code. Natural Language Processing (NLP) is the foundation of general Artificial Intelligence (AI). For purposes of this document and project, NLP-AI (or usually just AI) refers to the ability of a computer program to at least match the human brain's ability to reason and communicate. This is sometimes referred to as Natural Language Understanding rather than just Natural Language Processing, but when I use NLP in this document, I mean it to refer to understanding text, not just processing it. The ultimate goal of a general AI would include all the senses (sight, hearing, even touch, smell and taste), but that goes beyond the current scope of this project. Due to the success of IBM's Watson program beating the top players in TV's Jeopardy, some people think that NLP is off and running, but the problem is that what Watson was doing had little in common with natural (conversational) language. It was not maintaining a dialogue, it was translating input into key words which would allow it to look up data in its massively extensive database of mostly trivia. AI-C is a long, long way from completion, but some of the already-existing ideas and data may be of use to others who are working on their own programs, even for those who may not agree with my overall approach. The current version of the program also contains many tools for working with words, as described herein. The AI-C Lookup program is just a tool for working on the AI-C database. The program is NOT the point nor purpose of AI-C, although the routines in the program should eventually be helpful in writing code for using AI-C as an NLP-AI program. No copyrights are attached to anything by me on the AeyeC web site. Any of this can be adapted in any way you wish. If anyone wants to adopt the AI-C database and approach, that is great too. Either way, I am more than happy to cooperate with others on AI/NLP. There is NOT just one way of doing anything in AI-C. I have found myself constantly wanting to add this disclaimer to this documentation and to the program code, but I have instead displayed it prominently here and hope that serves for the whole docs/code. The structure of the databases lends itself to an almost endless variety of approaches. This document and the source code show how I am doing things, so naturally these are the methods I think are best, but because the databases are so flexibly designed, there are an almost endless number of different approaches and techniques which could be used. Feel free.
Designing A NLP SystemWho's to say what the best approach to NLP is? If you want to debate what the best word processor or best database program or best web browser is, you can run, test, and compare competing programs to see which ones accomplish their goals the best, but when it comes to full, general AI, there are not any programs which have achieved that goal, so how can anyone say for sure which approach, if any, will ultimately achieve the best results? Designing an AI/NLP system is relatively easy. The biggest problem facing AI/NLP developers is capturing real-world data. Even if we do not worry about capturing specialized data, such as every detail of medical science, or even every detail of human anatomy, just capturing the majority of common real world data known by, say, the average 20-year-old, is an enormous task. Most, maybe nearly all, NLP projects to date have tried to acquire real-world knowledge (often called "common sense data") by getting people to enter facts either as sentences or into forms. (A more recent project by the name of NELL, and probably some others unknown to me, acquire data by reading pages on the Internet. NELL appears to be focused on gathering facts about known entities, such as people and places, so their efforts are only marginally applicable to working on full AI/NLP.) If the approach of using random web surfers to build a knowledge base were applied to building a tall office building, it would start with the architects coming up with a plan for the building, then sending into the building site passersby to work on whatever suits their fancy. Most passersby are not expert builders, plus most of them already have jobs and cannot dedicate a lot of time to the new building. Most will do a little sloppy work and then move on. It is unlikely that few, if any, will have time, expertise, and willingness to do a proper job over the long period of time required to complete the building. Worse yet, since the people can work on whatever they want, some may try to put up walls and ceilings while others are painting, laying carpet, etc., all the while, no proper foundation has been laid for them to build upon. And many people will make errors which will weaken the overall structure. Here is a more sensible way to use volunteers to start from an extensive list of words (assuming English, in this example). Get volunteers to sign up, including email address and screen name. For each word show:
Show guidelines:
Most NLP projects do attempt to define words by linking them to classes and to other words. The problem is that I have never seen a project which assigns an ID# to every entry and links the ID numbers rather than linking the actual words. The problem with having long links of words is reusing parts of the links with other words. If entry ID#s are used, then a single number can be used to refer to a long list of interlinked words, but without ID#s, the entire list may have to be used each time you want to refer to it.
AI/NLP TextbooksAI/NLP textbooks are attempting to teach how to do something which has never been done. How can any textbook say with certainty "this is the best way to do such-and-such" when there is no possible proof of the claim? The book Natural Language Understanding often discusses some approach to NLP at length and concludes by saying something like: given the current state of research in this area, we can't say if this will work. Another problem with NLP textbooks is that they are usually not based on original research; that is, the author is not someone who has an active NLP project which is well under way. Textbooks are expected to have a lot of references to other people's research, which results in textbooks which espouse the same theories that everyone else is putting forward. (In contrast, the book The Psychology of Reading is full of references to research done by the authors.) For example, as pointed out above, most NLP projects seem to use the same basic concepts in their knowledge base design, so NLP textbooks use that standard design as the basis for analyzing and discussing NLP approaches. But if the standard design is flawed, then that means all the NLP analysis and discussion which is based on that design is equally flawed, and I believe that to be the case. An often stated assertion in English AI/NLP texts is that it is inefficient to store the whole forms for different part of speech ("POS") forms such as (verbs:) walk, walks, walked, walking, (nouns:) car, cars, (adjectives:) hot, hotter, hottest, etc. Instead, they insist that it is more efficient to store the root of a word and add the suffixes as needed. The flaw in this argument is that it values disk space (needed to store the various forms) more highly than it values the time it takes to analyze words to see if they may be suffixed forms of some other words. In reality, disk space is cheap and processing time is very "expensive" when analyzing deep into branches of possible meanings. Time not spent figuring out if a word is a suffixed form of some other word is time that can be better spent trying to best determine the meanings of the words. Another problem with not storing whole forms of suffixed words is that the Words table also links to the Syllables and Pronunciation tables, which it cannot do for suffixed forms if the suffixed words are not in the Words table. And the Words table also stores for each word its Soundex code, its letters in alphabetical order (though this is presently only being used to unscramble scrambled words and might be removed if no other purpose is found), and with letters in reverse order, which is used for finding specified word endings (including suffixes) without having to go through the entire Words table to look for matching endings. Obviously, you cannot find pronunciations, syllabification, or any of the other special formats for word forms which are not in the Words table. It's also hard to understand how you would link, say, plural words in the Cortex when the plural words aren't in there, such as trying to express "pack of dogs". The proponents of not storing whole forms may have a way around these proble ms, but it is highly unlikely that the way is anywhere near as efficient as having the whole forms to work with. The bottom line is that the cost in disk space of storing all word forms is nothing compared to the loss of efficiency from NOT having all word forms stored. The fact is that I have never seen a book on NLP which analyzes how knowledge bases should best be designed, yet knowledge bases are the very foundation of NLP! It is pointless to talk about how something in NLP should be done when you have not established the design of the most important tool to be used in doing it.
Books and classes on NLP tend to focus on tools for analyzing text in order to extract information from the text,
AI-C's DesignThe design idea behind AI-C is to have a Cortex (table) which has no text in it, just pointers linking entries to each other along with numeric codes indicating the type of links. Text in AI-C is stored in tables and linked into the Cortex via pointers. Entries point to text and numbers in other tables will make up only a very small percentage of the Cortex; the vast majority of entries will be linking entries within the Cortex. Example:
Note that entry #5 does not point directly to any words. It points to other entries which (eventually) point to words. This allows you to reuse the entry for big - red with other words or linked sets of words, such as:
#7: WordID#5 (engine - noun) #8: 6 (fire) - 7 (engine) #9: 4 (big red) - 8 (fire engine) #10: WordID#6 (fireman - noun) #11: WordID#7 (rides - verb) #12: 10 (fireman) - 11 (rides) #13: 12 (fireman rides) - 9 (big red fire engine) As you can see, the last entry combines 6 words with only two pointers. In addition, this approach allows you to search for any combination of these words, such as finding all things which are big-and-red:
It should be noted that the way we look up entries is that the fields in the Words table and the Cortex table are indexed so that retrieval of specific entries is virtually instantaneous. Compare this to a knowledge base made up of "facts" in sentences, such as:
For an example of a variety of linked entries, look up aardvark in the AI-C Lookup program. One entry, #126141: aardvark's excellent (sense of) hearing and good (sense of) smell locate prey links 20 different Cortex entries.
AI-C Project PhasesI have been working on this project in one form or another since the late 1980's, starting over again and again. One time I actually completed the addition of my list of 120,000+ words to a database, but I decided that the approach (using syllables instead of words as the basic unit of text) was not right, so I started over from scratch using words as the basic unit of text storage. That was the point where I started the first phase of the current design. Obviously, I'm not pressured by any kind of timetable nor profit motive. I just do this for the challenge and the fun of it. The first phase was creating the vocabulary database described below where words are linked to parts of speech, to pronunciations and to their syllabifications. On Oct.1, 2009, this phase was completed to the extent of the word lists available to me at the time. Of course, the addition of new words will be an unending process. At Oct.1, 2009, the Words table had over 136,000 entries in it, although a significant percentage of these are forms of root words; that is, the table includes multiple forms of verbs, adjectives, etc. For example, while the verb "to play" may have one basic meaning, its forms in the Words table include "play, played, playing, plays". And of course you also have "play, plays" - noun and noun plural - and "play" as an adjective (e.g.: "play money"). The second phase started with parsing definitions from dictionaries and linking the results in the Cortex. When all the definitions are done, the plan at this time for the next phase is to parse entries from Wikipedia and add that information to the database via linking. While a dictionary provides a very brief definition of a word, an encyclopedia gets into much greater detail, which is what we want in AI-C. The third phase, having created a solid foundation of language data, would be to have AI-C read text from various sources (mainly the Internet) and integrate the data into its Cortex (and related tables). The next phases would be to put AI-C to whatever use is possible after having completed the other phases. I am 64 and would be pleasantly surprised to live long enough to see this phase (or even the third phase) in gear.
DownloadsThe files linked below are not there (other than the VB6 runtimes) because I work on this project every day (at times) and it is too hard to keep everything cleaned up, archived, and uploaded on a regular basis. Contact me (see next section) and I will upload the latest files. When asking for the files, it would be appreciated if you would provide some information about your interest, what you plan on doing with the files, and how you learned about this web site. AI-C Database in Access 2007 format.
AI-C Lookup program VB6 source code and executable.
If you do not have VB6, you may need to install the
VB6 runtime modules to get the
This file you are reading serves as the documentation. It is included with the files above.
FordSoft File Viewer
The reason for listing it here is that it contains the AI-C spelling corrector
The zip file includes the VB6 source code and executable, though to run it,
Contact me regarding AI-CMy name is Nelson Ford. You can contact me via the email form linked below. Brains vs. AIThe Brain's CortexIn the brain, the Cortex (or neo-Cortex) is a 6-layer sheathing that encases the rest of the brain and is responsible for higher level brain functions. It is essentially an enormous collection of memory locations (sort of) with a massive amount of parallel interconnection between them. From this comes everything that we are and that we know and think -- our non-artificial intelligence. Brainiacs seem to agree that the Cortex is in the business of searching for patterns in its memory network with which to respond to input, and creating new patterns when appropriate ones are not found. When patterns are found, they are used to predict what is coming. This is what allows us humans to react to things in a timely manner. However, nobody knows exactly how all of this works, which is what makes trying to create Artificial Intelligence which mimics it so difficult. "Programs" in the brain.Even if we could create a huge database which resembles the brain's Cortex, it would not do anything unless we wrote a program to drive it. The program driving our Cortex is in our DNA. (This is not to be confused with "programs", such as for playing chess, which must somehow be stored in our brains as a very complex set of interconnected memory locations.) This DNA Brain program is not trivial. Scientists believe that the compressed version of the code (removing redundancy) is 30MB-100MB. This seems extremely large, given that the code only governs the framework for how the brain stores and processes data; that is, it does not contain programs for performing specific real-world tasks (such as playing chess). When I started this project, I certainly didn't anticipate having 30MB-100MB of compressed code with which to manage the Cortex data. I was under the impression that essentially all of the work of the Cortex was done in the Cortex itself. Like the brain, AI-C is made up of two parts -- the database which contains all the knowledge and the programs which work with the knowledge, but we have the advantage of being able to write programs for specific, complex tasks in stand-alone apps rather than by wiring together data points the way the brain does. Again I refer to the example of a chess program. Should AI work like our brains?No. The brain, for all its wondrousness, has also been the generator of unspeakable horrors. The brain is the seat of insanity, degeneracy, irresponsibility, self-destructiveness, and on and on. It would be folly to manufacture an AI which works exactly as does the brain and not expect the same types of problems. And yet, a lot of attention is currently (Dec.2009) being given to "reverse engineering" the brain for AI, including IBM's supposed emulation of a cat's brain. What really matters is to meet and exceed the output of the brain without the deficiencies of the brain. An analogy would be the manufacture of synthetic lumber for building decks, among other things. The goal is not to manufacture boards which are identical in all respects to lumber from trees, but to create artificial boards which serve the same purpose in deck building as lumber from trees, but not rot, splinter, warp, need to be resealed regularly, etc., the way "natural" boards do. The computer has numerous and very large advantages, so if it can be made to do the same thing as the brain, but with the advantages of the computer, it will be vastly better than the brain. The brain receives data input (externally or internally), looks for patterns which relate to the input, analyzes those patterns to try to predict what's coming next, stores and retrieves data, and generates output. These are all tasks at which the computer can excel. Advantages of the computer over the brain.
The brain absorbs spoken input at about the same speed as it absorbs printed input, whereas a computer can absorb printed input (by "printed", I mean electronic text in ASCII format) many, many times faster than spoken input. A single communication within a computer is millions of times faster than within the brain. And although the brain has the advantage of its communication being massively parallel, we may not be far from achieving that in computers.
When it comes to storing data, the basic unit of text storage which AI-C must handle is the word, or in special cases, a word fragment. Of course, the computer software below the database level does keep track of individual characters and somewhere in the computer hardware, code is even required to draw a character on the screen, but that is not something that the AI normally must worry about. The brain's cortex does not have the ability to store a word as a single unit. It is not even able to store individual letters. Neural coding is not completely understood, much less how the cortex stores information which it assembles into letters and words. The fact that the cortex stores small elements of letters is what allows the brain to recognize a letter in any font in which the elements are recognizable. Web sites use this ability as a security measure, distorting text so that computers cannot make out the letters, but the brain can, so a human user can enter the text to pass the security test. So it is possible that the brain's basic unit of word storage (for the printed form of a word) is the linked-together general features of each letter making up the word (though not necessarily the correct set of letters, when dealing with a word which was only heard and not seen spelled out). On the other hand, because the brain can store data using parallel connections, it does not have to recall letters (or letter fragments) one unit at a time. Instead, it probably has words stored like this:
| | | | | | | | | (i) (m) (p) (r) (o) (v) (i) (s) (e) | ... . . ... . . In addition to the printed form of a word, the cortex links together neurons in a way that stores the auditory form of words. In fact, as children, this is obviously what we learn and the brain stores before we learn to read. While parallel linking to each letter in a word may seem to be the equivalent of a single link to the whole word (as we do in AI-C), the problem with the former is that the brain's links for any of the individual letters may weaken and some or all of what makes the word may be forgotten. That brings up the next point: Perfect memory of virtually infinite durationThe brain's ability to remember data is dependent upon the number of times its related links are reinforced by repeating the data (and possibly by brain functions which take place while we sleep), and those links will still weaken over time if not reinforced in the future (although some evidence suggests that lifelong memories may be linked differently). In extreme contrast, if a link is created a single time to a piece of datum in the computer, AI-C will remember it forever. This leads to a significant difference in the superior way to store data in the brain and in the computer. The brain must get around the fact that it will gradually lose data whose links are not sufficiently reinforced and/or when it must make room for new data. For example, most scientists believe that rather than store every possible verb form, the cortex uses "rules" to create regular verb forms and stores only the irregular forms. Likewise, rather than storing every "pixel" in a visual image, it stores only key elements of images. One reason it does these things is that the more it must remember, the more it will eventually forget. Another issue is that the brain has a limited amount of data storage space, as discussed below. In contrast, the computer never forgets anything, so it can store ALL verb forms and only has to use rules to create verb forms for new verbs before storing them for the new verb. The advantage is that it is much faster to recall stored verb forms than to recalculate them each time, particularly since the words are indexed for immediate retrieval the first time they are stored. Recalling stored data is also more reliable than recomputing verb forms, reassembling whole pictures from parts, etc. Computer memory can be backed up, both locally and in distant locations away from local disasters, and restored if the computer's memory is damaged and/or must be replaced. In contrast, data in the brain can be lost forever due to brain damage either from outside trauma or internal disease, strokes, tumors, aneurisms, etc. Unlimited memory capacity.The ability of the brain to forget data (due to weakening links) can be viewed as a necessity for a brain with limited (albeit huge) room to store and recall data -- it cannot keep adding data to the cortex indefinitely. (See http://www.physorg.com/news185720165.html.) The computer has no such limitation, so it has no great need to let data fade in order to make room for new data. Still, it is easy enough to mimic the strengthening and weakening of links in AI-C, if need be, by linking to usage frequency counters within AI-C. Capacity for work.The brain must be put into a sleep state for about 6-8 hours a day. Less sleep can cause reduced brain performance. Total sleep deprivation can cause even more extreme mental problems and even death. We don't really know why this is or exactly what the brain is doing during this down time. Current theory is that functions take place which strengthen new data links formed during the day. Even with the 8 hours sleep, the brain cannot work on problems non-stop the other 16 hours a day, yet a computer's AI can run 24/7, non-stop, on a single problem if necessary. With multiple processors, it can work on many problems, as well as routine chores, at the same time, and do so continually at its highest level of performance, never getting tired, sick, or distracted. More efficient "programs"As previously mentioned, the greatest advantage of the computer may be that we (and in the future, AI-C) have the ability to write programs to accomplish tasks which our brains can only mimic by wiring together data points (similar to circa 1950's machine coding before the first programming language was invented). While the overall functioning of the brain is controlled by our DNA, a "program" in our brains to, say, play chess, is not a program but an extremely complex set of interconnected neurons. It boggles my mind to even think of such an approach. In contrast, a program for a computer to play chess can be written for that specific purpose. At one time, a disadvantage of computer game programs was that they did not allow the computer to learn by experience and adapt, but we are seeing more adaptive game programs being written these days, though even without that ability, IBM was able to develop a program which could beat top chess international grandmaster. Before I started working on AI-C, I was working on a bridge-playing program which stored all of the data for its decision making in data files rather than being hard-coded into a program. A major advantage of this approach is that the program can modify its own decision-making parameters. I finished the bidding part before deciding that AI-C was a more beneficial use of my time. In the 1980's, I wrote CardShark Hearts and CardShark Spades, but they used hard-coded algorithms. I had intended to make their algorithms self-modifiable, but they began to win at a high enough rate without doing so. And after all, nobody really wants to play against a program they can never beat. Some people may argue that humans can also run specialized programs, use calculators, etc., but that is not at all comparable. For the computer, specialized programs are, for all practical purposes, part of the computer's "brain" in that they are instantly accessible and can be directly integrated into the AI's "thought processes". AI-C's Database StructureMicrosoft AccessThe current general design of AI-C's database was created with Microsoft Access97 and later updated to Access 2007 and is no longer readable by Access97. (Even earlier versions were written with Btrieve.) Any further references to Access refer to the 2007 version. AI-C can be viewed and edited with Access, although using computer software written specifically for viewing and editing the AI-C data, such as the AI-C Lookup program documented below, is a much more efficient method. I am presently converting the database to SQLite for use on portable Android devices. If needed, the data could be provided in comma-delimited text files which could easily be imported into any database, including older versions of Access. Either way, others should be able to use their choice of programming languages with the data. The maximum size of an Access 2007 database is 2GB. As this is being written, the Cortex has about 120,000 entries and the total size of the database is about 80MB. Other tables make up some of that 80MB, but even so, it seems likely that the database will hold only about 2,500,000 entries before it hits 2GB. Dec.2010 note: the Cortex has only gone up by about 8000 entries, but the database has increased in size to over 1.1GB. When I noticed the database getting big, I split the Words table, the Syllables table, and the Pronunciation table and a bunch of other tables off to a new database called AIC Words, which is 88.6MB, leaving just the Cortex, LinkTypes, Numbers, and Shapes tables in AIC Cortex, which is now only 14MB. The Words tables (those with text in them) should not get a whole lot larger (although they could if other languages are added to thenm). Most important at this point is that 128,000 Cortex entries is only taking 14MB, meaning that about 18.3 million Cortex entries could fit in one 2GB database. Another limit to be faced some day is that the fields in the Cortex are all long integers, which can only be numbers up to just over 2 billion, so even if the 2GB database limit problem is solved somehow, at some point the fields would then need to be changed to double precision and the database would then double in size, allowing only half as many entries in the same size data file.. IBM's cat brain emulator took over 147 thousand CPUs and 144 TB of main memory, which for the human brain could translate to over 3000 TB. So 2GB looks pretty skimpy for the long run, but it gives me a lot of room to work for now and I'll worry about what to do next when I run out of that space. It is highly unlikely that I could ever make 2.5 million entries manually, so the 2GB limit is safe at least until I can automate the importing of text into linked entries in the database. IBM's Watson program (for playing "Jeopardy") has at its beck and call ninety IBM POWER 750 servers, 16 Terabytes of memory, and 4 Terabytes of of clustered storage. This is enclosed in ten racks including the servers, networking, shared disk system, and cluster controllers. These ninety POWER 750 servers have four POWER7 processors, each with eight cores. IBM Watson has a total of 2880 POWER7 cores. IBM’s POWER 750’s scalable design is capable of filling execution pipelines with instructions and data, keeping all the POWER7 processor cores busy. At 3.55 GHz, each of Watson’s POWER7 on-chip bandwidth is 500 Gigabytes per second. The total on-chip bandwidth for Watson’s 360 POWER7 processors is an astounding 180,000 Gigabytes per second Not your daddy's IBM-PC, eh? Finally, even if I am unable to crack the 2GB barrier, that should be enough room for me to complete enough of a Cortex database to be able to determine if the approach I'm using will work as an NLP-AI. If it does work and nobody has come up with anything better by then, perhaps someone else will be able to take it beyond 2GB. The Cortex tableThe Cortex table's fields
The Cortex table's data structure was designed to be as simple (and thus as flexible) as possible. A record consists of the following numeric fields: The five fields in the Cortex are all numeric (long integers, at this time). Text is stored in other tables, such as Words, Syllables, and Pronunciation, which are discussed in more detail further down. Numbers and dates can be stored in the Numbers table. The StartID# and NextID# are usually pointers to a pair of entries which are related in some way, and each (or neither) of which may, in turn, point to other related entries. The LinkType normally indicates the type of relationship between the two entries, and in some cases, the data table in which the source of the link resides or the type of data being entered, etc. However, LinkTypes are meaningless unless software has been written to recognize the LinkTypes and take appropriate action, hence all the usually's and normally's above. This will be discussed in more detail later. In March 2010, I added a Date Entered field. I didn't list it above because it is an automatic field and not essential to the Cortex. It may be handy to be able to see when entries were made, but because database space may become an issue at some point, it is a field that could be dropped in the future or moved into its own database. In August 2010, I added a Freq field. Again, I have not shown it above because it is not an essential field. It is designed to aid in selecting the most likely correction when a word is entered which is not in the database. Because the usage frequency is based on the part of speech for each word, When parsing input, it will also help in deciding which part of speech a word is most likely being used for. See Spell Corrector under Using the AI-C Lookup Program: Look-Up buttons for more information.
Each word in the Words table has a unique ID#. For a word to be usable in the Cortex data table, its WordID# must be entered along with its Part Of Speech (POS) in the LinkType field. Many, if not most, words have multiple parts of speech. An entry must be made in the Cortex for each WordID-POS combination, such as mail - noun and mail - verb. When entering a different form of a verb, adjective, adverb, or noun plural, the CortexID# of the root word is entered in the StartID field. (It may be useful to link a different POS as the root of a word, such as prince, noun, being the root of princely, adjective, and princely, adverb. I didn't think of this until Phase 1 was completed, so only a few such entries made after Phase 1 have links to roots with different POS's.)
When adding a WordID to the Cortex, syllabification and pronunciation entries are made in their respective tables. The Cortex does not use a word without a POS. A word cannot be defined or used without knowing its POS, so there is no point in adding it to the Cortex without the POS, though I use POS loosely because it could be an abbreviation, a prefix, etc. (See the LinksType table.) The Words table contains my initial, cleaned-up list of words from different sources, but not all words in the Words table were added to the Cortex. In some cases, it is because a word was a misspelling or not found in dictionaries. Most noun plurals were not entered into the Cortex if they are just formed by adding an s to the root. But there could also be some perfectly good words which just didn't get added for whatever reason. A WordID-POS entry will, in most cases, be linked as an element of (or type of) a broad category (set) which helps further differentiate it from other uses of the same word because a word inherits all the characteristics of the set to which it belongs. For example, abacus - noun can be a type of slab (one at the top of a column) or it can be a hand-held computing device. When linking other entries to abacus which further define it, you would not want to link them to the abacus - noun entry because they would be unlikely to apply to both meanings. Instead, you would create an entry linking the WordID-POS entry as an element of the set (or: a type of) slab - noun and another entry linking the same WordID-POS entry to the set device - noun, then link other entries to the appropriate one of those two entries. (Note: I have found that when I refer to a word as being an element of another word/set, I lose track of what that means. Example: I first linked eye as an element of face. In general, that is what element means: a constituent or part of a whole, but in set theory, it means a part of a given thing, similar in nature to it. I later realized that an eye does not inherit all the characteristics of a face, so it could not be a true element of it, for our purposes. When you think type of, it makes the process easier. An eye is a type of organ - or more precisely, a sense organ - and is a part of the face. For the record, an organ is a type of tissue and face is a type of forward, external part of a system which has an internal structure. Side note: as you can see, the set can be more than a single word.) (I have created a little problem for right now. It is very helpful to have an entry in the Type Of list, but it is not always easy to come up with something relevant which a word is a Type Of, but saying that it is a Part Of something serves much the same purpose, so when I re-linked words from Type Of links, I made them Part Of links. The problem is that a LinkType of component of already existed -- see abacus -- and component of seems logically the same as part of. I need to give this some more thought, but I may have to change all the component of entries to part of.)
Tables in AI-C have a Tag field which can be used by software to mark fields for later review, normally by a human. For example, run the subroutine FindDiffPrns to tag different pronunciations of the same word and the Tag will be used to mark such entries. Another maintenance field in the Pronunciation table is Ver which is used to indicate that a pronunciation has been verified. Computed pronunciations can be wrong because rules are not guaranteed to apply to all words, so knowing that a pronunciation has been verified is helpful in choosing between two pronunciations, computing new pronunciations, etc. You can delete those fields (or add others) if you wish, because...
There is NOT just one way of doing things in the Cortex. Normally, the Cortex does not use a word without a POS, but that's just the way I am doing it now. A programmer could write a routine using particular LinkTypes which link to text without POS's. Likewise, there may be many different ways to link related Cortex entries together. When linking two entries, many times it does not matter significantly which entry is in the Start and which is in the Next. And in the long run, the AI-C itself will likely reorganize links for optimum efficiency anyway. AI-C has what is probably the simplest possible database design (just five essential fields in the main table) and is available in the simplest possible formats (Access2007 or straight ASCII) with which you (assuming you program) can do anything you want using any programming language you want. Even building off the existing database and software, it should easily be possible to add fields to tables, add tables to the database, and even add new databases to the project, then incorporate them into AI-C by adding new LinkTypes which let you write code for dealing with them.
Words table
Each record in the Words table consists of an entry ID#, a text field which can be up to 50 characters, and a Soundex field (and currently, a word frequency field). Since the Words table was designed to hold only single words (i.e.: not phrases), it seemed like 50 characters should be plenty, but it can be easily changed. There are no duplicate entries in the Text field, although capitalization counts so that Ford (brand of a car) and ford (crossing a river) are NOT considered duplicates. However, I cannot set the index of text in the Words table to "no duplicates", because with that setting, Access ignores case and would not allow the two entries. I tried following Access' instructions for making the database software ignore case, but their suggestions did not work for me. Therefore, before adding a word to the Words table, it is necessary to check to make sure it is not already there, since the database is set to allow duplicates even though we don't want them. Similarly, if you check to see if a word is already in the Words table and the database engine says it is, check to make sure that the capitalization is the same. That is, if you search for "ford" and the Words table finds "Ford", it will tell you that a match has been found. If the case does not match, continue searching to see if it finds the word with matching case.
Many years ago I was writing an HTML editor (which, in fact, I am using to write this document) and wanted to add spell-checking. To do this, I searched the Web for word lists. I found a bunch, but they were all pretty junky. I compiled them, cleaned them up, ran them through other spell checkers, and ended up with a list of about 100,000 words. When I (re)started this A.I. project, I began with that list for the Words table, adding to it when needed and creating entries in the Syllables, Pronunciation, and Cortex POS tables. Not every word in the Words table was linked into the Cortex. Words (in the Words table) without such entries should be considered suspect -- either misspellings or made up; however, they could be obscure, but legitimate words. Unused words were left in the table because they don't take up much room. Common misspellings can be included in the database and linked to the proper spellings with a LinkType of misspelling. A spelling corrector added to the AI-C Lookup program in mid-2010 has proven to be so accurate that it almost always can find the intended word from any normal types of misspellings, so the entries for misspellings are probably not necessary. Spanish words: In the late 1980's, I wrote an English-Spanish dictionary by the name of Ventanas: Spanish For Windows. (I guess that was my attempt at being clever since for you non-Spanish speakers, the word ventanas is literally the Spanish word for windows. And at the time the program was written, many Windows programs were identified as being for Windows since Windows was still relatively new.) As of March 22, 2010, I had not looked at it for a long time, so I tried running it under Windows 7 and the program still runs (in Visual Basic 3). Looking at it again was funny because I had absolutely no recollection of how it was designed, so it was like looking at someone else's program. In the late 1990's, I wrote an updated version of the software (using the same basic database of words) that listed categories, synonyms, and more. The programs have some interesting features, such as showing POSs, style, synonyms for both the English and Spanish words, other translations of the word in each language, words/phrases with the word's root, and full conjugations of verbs. But the most interesting feature of all at this time is the ability to export all the data to a text file, which will allow me to import it into AI-C at some point (but not right now). It was also of interest to be able to see where I was with this kind of project almost 20 years ago. The fact that Ventana's database has a pretty long list of English words in it makes me think that this may have been the actual starting prior database of words for AI-C, contrary to what I said above, though I have no memory of it. Ah, well. Not having a memory is what keeps things fresh! (In case it's not obvious, I'm kinda old.) If interested, you can download Ventanas by right-clicking on the preceding link and saving to disk. This is the first version. I never finished and circulated the second version, though the program runs. I'll upload it and its later database when I get the chance. I'll also upload all the source code. If you Google, you may still be able to find places where you can download the actual VB3 programming language, though I don't know if you can compile with it under Win7, nor why you would want to do so.
Even large, big-name dictionaries have errors, inaccuracies, ambiguities (see the Random House definition at the end of this document), and inconsistencies in them. In the course of this project, I have found dozens and dozens of basic errors (i.e.: typos, circular references, etc.) in such sources. This is understandable as these are very large works assembled by humans, and we all make mistakes. Unlike a computer database, dictionary publishers have no automatic way to enforce consistency or to verify accuracy. The Cortex database may also have errors in it, having been created by humans (giving myself the benefit of the doubt), but with the difference that unlike printed dictionaries, the database can easily be corrected and over time, errors winnowed out. I have also written numerous routines which the computer can run to look for and correct some types of errors. But even if errors get into the Cortex, it doesn't mean that they will be used. Before the Cortex can be used for A.I., software will have to parse documents, wiki's, etc., and translate such text into linked concepts in the Cortex. Once the Cortex hits a critical mass, new text will be understandable to AI-C by examining links in the Cortex. If a word or link is incorrect, it will probably never become interlinked with the rest of the Cortex, so it will never be used; or if used and is recognized by a human as wrong, it can easily be corrected. Finally, as mentioned elsewhere, a significant percentage of what people communicate is incorrect -- either wrong words, misspelled words, improper grammar, redundant words, or excluded words. On top of that you can add faulty logic and incorrect "facts", either by accident or on purpose. It is not enough that AI-C can understand proper English (or other language); like humans, it must understand what humans are trying to say.
Entries for major Prefixes and Suffixes are included in the Words list and are given a prefix or suffix POS link entry in the Cortex. I debated doing this, but decided it might prove useful and couldn't really hurt, particularly for prefixes like un- and non-, which can be used with hundreds (if not thousands) of words to make new words. Actually, any kind of bits of text (such as other parts of words) can be stored in the Words table for use by the Cortex.
Verb forms, such as past tense, present participles/gerund, and 3rd-person singular, have been included in the Words table, even though the book Speech and Language Processing, considered by many to be the Bible of NLP, says: the idea of listing every noun and verb [form] can be quite inefficient. While it is true that space could have been saved by using rules for regular word forms instead of entering all the forms, the Words table is very small relative to what the Cortex will ultimately become. Having all noun, adjective, and verb forms in the Words table should simplify (and thus speed up) parsing sentences and finding words, which is far more important saving a little disk space. Here's an example: What is the present tense word for the past tense word: indebted? |