In the early days of High-Performance Computing (HPC) the concept of solving complex problems was driven by how much processing power was available to achieve the goal all within a reasonable amount of time. At the same time the algorithms that made up the HPC applications were designed to take full advantage of the processing capabilities with no limits insight. Moore’s law supported this idea with impressive innovations and powerful CPUs to keep up with demand and keep the compute intensive applications happy.
It is fair to say that most companies considered the source-code for their algorithms as the core Intellectual Property (IP) to their business and protected it accordingly. Patenting an algorithm was an effective way to prohibit somebody else from using the same or similar methodology. The algorithm was the differentiator between them and the competition. The combination of the algorithm with an HPC environment gave companies a working model that generated repeatable results with a predictable growth path. However, it is very difficult to build past behavior into rule-based algorithms without having the capability to analyze and learn from the past in detail and is one of the main problems of a compute-centric approach.
The open-source community has been the major technology driver behind the Big Data push, addressing the need for a more data-centric approach. That is the ability to create a working model that can predict future outcomes based on past events (data and can be continuously improved as new data is brought into the process. Big Data represents the volume of data that is being generated daily, the speed (velocity) at which that data is coming in, different of data formats such as structured or unstructured and with a variation in data quality.
Artificial Intelligence (AI) builds on top of HPC for large scale compute processing and on Big Data with open source.
The real IP now is the data and the main differentiator between competitors. The algorithms that are used in AI are software frameworks that are created and shared by many people. The sharing of ideas and concepts is a key component to the growing success of AI.
With the democratization of data, it is now possible for non-data scientists to collect and analyze data with little help and without the need of a science degree. There are numerous guides and tutorials that will get you started with AI. It doesn’t guarantee success, but it brings AI within greater reach and the drive towards self-service AI. The data is front and center and will define any success or failure and having deep knowledge and understanding of the data comes with great responsibility.
Not all data is equal. It is best to start an AI training project with a collection of data that resembles as close possible the data domain to be analyzed. In the end, an AI engine will analyze your data and suggest decisions based on the given training data. If the data quality is low the AI engine will make inaccurate predictions. That doesn’t mean that everything must be perfect but taking any precautions to make sure that the percentage of so to speak bad data is significantly lower compared to good data. Let’s say for example that we want to create a model to recognize different kinds of cats in pictures and that we have unwillingly been given a mixed data set of pictures of cats (50%) and dogs (50%). The resulting model will have features from both dogs and cats impacting the accuracy of the model for cats. On the other hand, the presence of a small percentage of dog pictures will not have a major impact on the accuracy.
At the start of a new project the initial data is divided into training and test data. The training data will be used by the AI engine to analyze and generate a model that is a statistical representation of all given training data. The test data is used to test and validate the new model created with the training data. In general, it is typically close to an 80/20 rule for training/test data.
Before data can be used for any type of analytical analysis and deliver insights it needs to be “cleaned”. It is the concept of making sure that there are no inconsistencies in the data as it can impact the analysis and results significantly. Some of the potential sources of bad data can be human error, data mismatch when assimilating multiple source and missing or duplicate data.
For example, when collecting dates from past earthquakes it includes validating that the date represents a valid date and occurred in the past. Depending on the processing rules the date can be corrected or the whole row of data fields associated with the invalid date can be thrown out. After the cleaning comes the “formatting” and labeling of the data such that it is in the right format for the AI tools. It does require significant human interaction and time to go through all the data and certainly for unstructured data. The good news is that new tools are brought to market that can help not only automate the process but also reduce human error. The AI driven tools analyze the data structure of the data and then predict the kind of errors expected to see in the data. In a typical AI lifecycle, due to its importance, a lot of time is spent on this activity and is continuous as new data must go through this process.
It is all about the data! We live in a data-centric world where data has become extremely valuable. The ability to extract value out of that data with AI has opened many new opportunities (most of them with good intentions). With AI it is important not to fixate too much on getting as close as possible to 99.9% accuracy but rather to focus on improving the overall process and acquiring quality data. It all starts with the data as the algorithm can’t fix bad data and at best it can try to minimize its impact. The data comes with a lot of responsibility and with the promise of providing deep knowledge and understanding from that data.