So what’s all this talk about annotation?

by AI Business
Article Image

by Charly Walther, Lionbridge

If there’s one thing that's plentiful in artificial intelligence (AI), it’s buzzwords. One of those is “annotation” -- a crucial, data-related term we’d like to break down for readers here.

So what is it?

Annotation is the process of tagging all forms of data, including text, audio, and images. Through integrating this tagged data, machine learning algorithms can learn to recognize patterns. That’s why it’s annotation is so essential. If AI developers import data that hasn’t been properly tagged, the AI won’t learn correctly.

How is annotation done?

  1. Semantic annotation

Semantic annotation is a form of tagging that’s used in natural language processing (NLP) to ascribe meaning to words so that machine learning algorithms can read provided data. Use cases include search engine relevance improvements and chatbot training.

  1. Image and video annotation

Image recognition and processing are used to keep automatic recognition systems secure, to build self-driving cars, to classify e-commerce product lists, and to perform other tasks. These types of AI may sound dissimilar, but one thing they all have in common is the need to understand the context behind images and videos in order to work. And this takes data.

Image and video annotation begins with a bounding box – a rectangle or square annotators draw on top of the picture. Annotators also add tags so the machine learning model can recognize what’s in the box and learn when it’s different from objects in other boxes.

  1. Text classification

Don’t let the word “classification” fool you. Text classification is just another form of annotation. Here, data engineers assign predefined categories to written material. Depending on the project, they might tag sentences or paragraphs by topic – such as national, international, sports, or entertainment, for AI that works with news.

  1. Entity annotation

Another popular buzzword, “entity” is the AI term for a collection of objects that all fall in same category for building data – like people versus things or places. Accordingly, entity annotation is the process of tagging unstructured text with this category info so the AI can process data connected to that entity.

Entity annotation can happen in different ways. Many solutions have more than one type of annotation built in, allowing data scientists to better manipulate data as needed. One of these is phrase chunking. Phrase chunking consists of tagging a part of speech with its appropriate classification -- like noun, verb, or adjective. Some machine learning datasets require this work for every word.

Then there’s intent extraction, a form of entity annotation commonly used in building chatbots. With this technique, it’s important for algorithms to accurately determine what users really mean (their “intent”) when they ask the bot a question. Intent extraction tags data on the phrase or sentence level, building a library of expressions that the algorithm can use later to understand new sentences.

Another form of entity annotation, entity linking, is the process of linking related words together. For example, a company name might be associated with a certain employee name, a person with a place of residence, and so on.

Named entity recognition is a form of entity annotation that tags words and phrases with their meaning.

Although it may seem simple, annotation work is actually quite intricate, with content and complexity levels that vary greatly from project to project, the work sometimes extending into multiple categories. To keep the time that it takes to annotate under control, AI engineers often rely on third-party companies specifically skilled in the task. Crowdsourcing services can also help tag text, image, audio, and other data sets. The clearer the instructions these services receive, the larger a project’s return on investment, making selecting the right partner key not just to timeline and cost, but every part of AI development.


Charly Walther is VP of product and growth at Lionbridge, a company specializing in machine translation and language data services.

Practitioner Portal - for AI practitioners

Story

Hesai and Scale AI open-source LiDAR data set for autonomous car training

6/2/2020

Scale claims this is the first time such data has been released with zero restrictions

Story

IBM adds free AI training data sets to Data Asset eXchange

5/28/2020

Big Blue has something for you

Practitioner Portal

EBooks

More EBooks

Upcoming Webinars

More Webinars

Experts in AI

Partner Perspectives

content from our sponsors

Research Reports

9/30/2019
More Research Reports

Infographics

Understanding the advantages of AI chatbots over rule-based chatbots

Infographics archive

Newsletter Sign Up


Sign Up