Why online data collection is key to the future of responsible AI developmentWhy online data collection is key to the future of responsible AI development
Teaching AI systems properly can be impossible without following the proper data collection protocols
September 2, 2020
Many organizations see AI initiatives as an integral part of their future, and research by Deloitte even shows that 73% of IT and line-of-business executives see AI as an indispensable part of their current business.
But AI systems can only ever be as powerful as the information that they are built on, and huge quantities of very specific data are needed to effectively train these systems.
Where does this information come from?
Often it comes from the largest source of information that has ever existed – publicly available online data. Social media data, to give just one example, is being utilized by organizations as a source of information about consumer sentiment and behavior. This data is being used to develop AI systems by businesses in industries as varied as insurance, market research, consumer finance, and real estate to gain an edge over their competition.
In these instances, information such as Twitter posts and online reviews data is leveraged to develop the AI insights needed to stay afloat in a volatile business environment. For example, hiring announcements on Twitter or other job websites for positions in the automotive industry could indicate an economic rebound in that sector, or that the industry itself anticipates an uptick in demand.
But collecting data at this mammoth scale is not without its challenges. Organizations are often blocked by competitors or for other reasons in the process of gathering data, or they encounter difficulties collecting data in every region they are looking to target globally. But perhaps just as challenging is the need to counterbalance these goals with the onus to treat consumer data with respect it deserves and remain in accordance with data protection legislation such as GDPR. If businesses want to be able develop the AI systems, they need to stay competitive, and do so responsibly, taking the right approach to online data collection is a non-negotiable requirement.
Why the data collection methodology is all-important
However, teaching AI systems properly can be impossible without following the proper data collection protocols, because only “clean” accurate data can create the right level of ROI for businesses. Often, requests seen as coming from data centers are blocked by websites, or fed incorrect information, as businesses want to prevent data scraping by their competition to gain a competitive advantage. Using a robust network of residential IP’s solves this problem, as your data collection appears indistinguishable from the activity of real consumers and yields the same data points the users' activity will.
Developers can use IP proxy networks to view online information exactly in the way it appears to everyday consumers. This allows businesses to simulate a realistic and transparent view of the internet in every region, which is vital as products are often priced dynamically, and a different price may be charged for an online product for a consumer in California compared to Poland. If organizations' data sets fail to reflect these subtle distinctions, then it’s impossible for them to produce the value creating AI models they need to prosper. In fact research from Cognilytica indicates that corporations spend over 80% of their project time on cleaning data in preparation for AI usage, demonstrating that employing the right data collection methodology pays.
How businesses can collect better, cleaner data
For AI and machine learning to generate meaningful return on investment (ROI) for businesses, clean data-sourcing must be situated at the very top of the priority list and stay there indefinitely. Businesses need to plan carefully if they want to practically implement this. This planning includes defining goals for what they ultimately want to accomplish with their AI endeavors, for example, predicting the future prices of real estate in an area. Businesses then need to define the data that is most needed to fulfill this goal, for example, Google Maps data which indicates the number of new businesses opening in the area, which can act as a rough facsimile of economic growth in the region.
Lastly, businesses need to adopt a data collection platform that can consistently feed them the data they need. It will need to be a global network, with capacity to handle gargantuan data volumes, that incorporates consumer devices in every location.
How better data makes for better decisions
Building an AI system is like building a house. You can have the best architect or the best team of builders on the planet, but if there are any flaws with the raw materials, they are the wrong type, or there are simply not enough of them, there are going to be serious issues with the final product. If you build on a foundation consisting of clean and accurate online data sources, you will have a robust base that you can build powerful AI systems on top of. These systems will be able to provide powerful, dependable, and accurate business insights despite the unprecedented volatility in market trends.
The path to correct decision making is more fraught with potential for error than ever before, more and more business decisions being made firmly on the basis of AI derived insights, and collecting the best data possible represents a viable shortcut to making the best possible decisions.
Omri Orgad is Managing Director for North America at Luminati Networks, the world's largest proxy service for business.