berkus wrote:I gave you a direct link which you refused then to read. Blame me.
I didn't refuse to read it - I glanced through it and didn't recognise it as being that project, and when I came upon parts that I didn't want to be influenced by I saved it to read some time in the future once I've had a proper go of my own at designing algoriths to do the same kinds of thing. Once you've got one method of doing something in your head, you're much less likely to find alternative ways which might be more efficient, so if I try to find my own ways first and then find that other people have found better ways, I don't lose anything because I can switch to doing things their way, whereas if I just go with their way or am influenced by them, I could miss something much better that I would have found otherwise.
berkus wrote:Lets take it to another deeper level then. Care to elaborate how it works in detail?
I don't care to, and I haven't designed this part of the process in detail either as my current starting point is text, but I will give you an overview of how I would do some of the initial stages. The first step's to extract a string of phonemes from the sound stream, and this string will often include sounds which aren't phonemes but have been misidentified as speech sounds. Some phonemes may also be misidentified due to unclear articulation or background noise combining with them. Others may be missing after being masked by background noise. Using stereo sound, or ideally four microphones for precise directionality, would make it easier to eliminate noise from other directions, but that isn't possible if you're taking the sound from a mono recording. Anyway, you'll end up with a string of phonemes something like this: enjwj iwlendapwy7astrykqvfonimzsam7yklajc2ys. The next stage of the process is to look up a phonetic dictionary to try to split the string of phonemes into words, and there may be multiple theories as to where some of the boundaries lie, so more than one set of the data may be passed on. If nothing fits exactly, the best fit must be looked for, and we might return to have another go at this if later stages of analysis suggest we've taken a wrong turn. There is no need to translate to standard spellings as we're following a different route from the one that would be taken if the machine was taking text as input (which is the route I've actually programmed for), but the process itself is now practically the same and merges a little further on. The meanings of words are checked to see if they fit with the context of what came before - this may pick up words which have been misheard if a similar sounding word of more likely meaning exists and would fit in the sentence. Next we get into an area which checks the way the sentence is constructed, building theories as to how it might hang together - many words can behave as nouns or verbs, for example, and you have to branch every time you meet one (sometimes several ways at a time), multiplying the amount of data you have to handle to the point where it could quickly overload the memory of the machine if you go about it the wrong way. You have to go with the most likely routes instead and label certain branch points to be looked at later on if the more likely routes break down. I do not want to describe my way of doing this, nor go through all the steps that follow, other than to say that several theories as to what the sentence means may survive the process, so it's necessary to see which make the most sense and which best fit the context. It may not be possible to work out a clear winner, so the data may need to be stored in a linked form where all the surviving meanings are kept but with different probabilities tied to each, and if one meaning is ruled out later on it will change the probabilities of the surviving ones. The machine may decide to ask for clarification if it's holding a conversation (and if the information sounds as if it may be important), but that won't be possible if it's just listening to audio files. If the sentence makes so little sense that there must be an error in it, other meanings can be considered and the sounds checked against the original input to see if they may fit sufficiently well for a "mishearing" to be the most likely explanation. The information, once the best theory as to what it is has been determined, is then rated for how likely it is to be true, and that will be based on who it came from (a code representing the source is used, so if the reliability rating of the source changes, the reliability of the data is automatically changed - this may lead to a cascade of other changes to probabilities throughout the database, depending on the significance of the source and the data) and how well it fits in with existing knowledge in the database. Most words and names are stored as 16-bit codes, but less common ones take prefixes, and large parts of sentences which are really just identifiers such as "the film we saw yesterday" can be converted into a single code which simply represents the film in question if it can be identified from that description (though the information that "we saw that film yesterday" may also be new data to the machine that could be stored) - overall this makes the stored data much less bulky than the original text or phonetic string. To avoid overloading the database, information has to be rated for its importance too, so if space becomes tight the less important stuff can be junked.
I've gone into more than enough detail there for anyone else who's doing this kind of work to know that I really am doing it, so that's all you're going to get from me for the time being.