Pipelines
The biggest driver of cost, quality, and speed
Diagram of a typical human data pipeline at micro1
In almost all cases, behind the scenes a project is never as simple as distributing the tasks equally amongst annotators, completing them, and delivering them back to the client. In order to create tasks that meet the rigorous requirements for being useful training data, we need to ensure that all the tasks meet the standards. Additionally, depending on the type of task and complexity it might be better to get multiple opinions on a single task. You can think of a task in a pipeline as a product on an assembly line. Some projects might have multiple stations where data is added to the task, while some projects will have only 1.
As discussed in the previous section, we would route a task to an individual pod made of up 10 annotators, 5 reviewers, and a pod lead. An annotator would attempt the task, and then rate its difficulty. The difficulty can be due to the inherent request of the task, for instance imagine having to verify that an AI-generated essay has both cited all statements and that the cited resources do back-up the claims from the essay. Depending on the length of the essay or number of citations, an annotator could mark the task as Easy, Medium or Hard. Medium would prompt the task to be completed by an additional annotator, after which a reviewer would look at both annotators’ work, fact-check, and try to merge both results. By having more human eyes on the task, we are able to minimize the occurrence of mistakes. If a task is rated as hard, not only would the aforementioned happen, but this task would also be reviewed by the Pod Lead, who would have previously been a senior reviewer in this specialization. A sanity check would then be conducted by AI and tasks flagged to the Pod Lead. Finally, the Pod Lead would be responsible for signing off each task. This pipeline introduces many opportunities to catch any possible mistakes, but it is not the lowest cost possible for a pipeline. The design of your project’s pipeline will directly affect the quality, cost, and speed of your project and in urgent client requirements it is not unheard of to have available Pod Leads working directly on a task from start to finish.