Big data? Small data? Metadata in a Data lake? I know there is some kind of joke there. Too bad we insist on writing everything in English…
Language and vocabulary can be difficult. Especially when it comes to the world of data science.
Therefore, we have described 50 of the most common terms in this glossary. Hopefully, this will help you in your journey to become data driven.
Tip: Press ctrl + F or cmd + F (if you are on a mac) and search for specific words or terms.
Data Driven Glossary
3 Times Understanding
A structured process focusing on data to gain insights and knowledge about the overall needs from three different perspectives; business, know-how, and technology.
5P is the process we use to identify what needs to be in place for building your data pipeline. It stands for; People, Processes, Pipelines, Platforms, and Partners.
API is the acronym for Application Programming Interface, which is a software intermediary that allows two applications to talk to each other.
The API economy refers to the way application programming interfaces (APIs) can positively affect a company’s profitability, where the APIs enable businesses to either scale quickly by leveraging APIs to access third-party data and services or turn its services and data into a platform that attract partners to build upon and brings new customers onto its platform in the process.
Access to data is critical for the success of your business. Easily accessible data enables you to move quickly, focus on the product, and build a data-informed culture where data leads to better decisions and action.
Data adoption is a process through which businesses find innovative ways to enhance productivity and predict risk to satisfy customers’ needs more efficiently.
The collection of data from multiple sources to bring all the data together into a common athenaeum for reporting and/or analysis.
An algorithm is a set of well-defined instructions in sequence to solve a problem.
Responsible for collecting, processing, and performing statistical analysis of data. A data analyst discovers the ways how this data can be used to help the organization in making better business decisions. It is one of the big data terms that define a big data career. Data analyst works with end business users to define the types of the analytical report required in business.
Data analytics is the science of analyzing raw data to make conclusions about that information. Many of the techniques and processes of data analytics have been automated into mechanical processes and algorithms that work over raw data for human consumption.
Data Cleansing/Scrubbing/Cleaning is a process of revising data to remove incorrect spellings, duplicate entries, adding missing data, and providing consistency. It is required as incorrect data can lead to bad analysis and wrong conclusions.
When data is communicated, whether it shows good or bad results.
Decisions being made based on data.
The process of democratizing data means making data accessible to as many people as possible within a company. Decisions can then be made using data that’s tangible, easily understood, and business-focused. Data democratization happens by sharing data in the right formats and channels, according to each user’s profile and level of knowledge.
A data driven organization is an organization that is highly committed to gathering data regarding all aspects of the business and by enabling employees at every level to use the right data at the right time, data can foster conclusive decision-making and become a part of the companies’ competitive advantage. When a company employs a data driven approach, it means it makes strategic decisions based on data analysis and interpretation.
Combining data from multiple separate business systems into a single unified view, often called a single view of the truth. This unified view is typically stored in a central data repository known as a data warehouse.
The organization knows what data they have access to, what they want to do with it, and they also have a process for going from question to action using their data.
A data lab is a designated data science system that is intended to uncover all that your data has to offer. As a space that facilitates data science and accelerates data experimentation, data labs uncover which questions businesses should ask, then help to find the answer.
A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
People in the organization can read, understand, and communicate data.
Data mentoring is using an advisor to educate your organization in your first steps towards becoming data-driven. It adds value through increased market insight, holistic data analysis, and practical knowledge.
Data mining refers to techniques for deep data exploration. Data mining is done to extract relevant conclusions that enable more accurate business and/or strategic decisions.
An abstract model that organizes elements of data and standardizes how they relate to one another and the properties of real-world entities.
Data modeling is the process of creating a data model for an information system by using certain formal techniques. Data modeling is used to define and analyze the requirement of data for supporting business processes.
A data pipeline aggregates, organizes and moves data to a destination for storage, insights, and analysis. Modern data pipeline systems automate the ETL (extract, transform, load) process and include data ingestion, processing, filtering, transformation, and movement across any cloud architecture and adds additional layers of resiliency against failure.
A data platform combines all of the data from various data sets and acts as a centralized hub where it can be accessed for analysis and integrations. A data platform for companies in the food industry collects data from multiple systems (ERP, POS, open data, data warehouse, and much more), harmonizes it into usable and uniform structures, and provides managed APIs and applications to access the data.
Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is ”fit for its intended use in operations, decision making, and planning”.
A data scientist is a person proficient in mathematics, statistics, computer science, and/or data visualization who establishes data models and algorithms for complex problems to solve them.
A data strategy is a vision for how a company will collect, store, manage, share, and use data.
Data visualization is the presentation of data in a graphical or pictorial format designed to communicate information or derive meaning. It allows the users/decision-makers to see analyzes visually in order to easier understand new concepts. This data helps
• to derive insight and meaning from the data
• in the communication of data and information in a more effective manner
The data warehouse is a system for storing data for analysis and reporting. It is believed to be the main component of business intelligence. Data stored in the warehouse is uploaded from operational systems like sales or marketing.
Data Scraping, or web scraping, is an automated technique of gathering, i.e to copy, data from the web using a scraper. The scraper is set to extract specific data from targeted websites. Once it extracts the data, the scraper parses it and stores it in a spreadsheet or database in a readable format.
Decision intelligence is a practical domain that includes a wide range of decision-making techniques. It brings both traditional and advanced disciplines together to design, model, align, execute, monitor, and adjust decision models and processes. The disciplines include decision management (including advanced nondeterministic techniques such as agent-based systems) and decision support, as well as techniques such as descriptive, diagnostic, and predictive analytics.
A visual representation of a process that shows how data and knowledge are merged to make a particular business decision.
Linked data refers to the collection of interconnected datasets that can be shared or published on the web and collaborate with machines and users. It is highly structured. It is used in building Semantic Web in which a large amount of data is available in the standard format on the web.
Location analytics is the process of gaining insights from geographic components or location of business data. It is the visual effect of analyzing and interpreting the information which is portrayed by data and allows the user to connect location-related information with the dataset.
Machine learning is the study of computer algorithms that improve automatically through experience. It is seen as a part of artificial intelligence. It applies statistical strategies and methods for using data to ”train” computers to detect and ”learn” rules for solving a task, without the computers being programmed with rules for that task ahead of time. Machine learning is used to exploit the opportunities hidden in big data.
Metadata is data about data. It is administrative, descriptive, and structural data that identifies the assets.
Network analysis is the application of graph/chart theory that is used to categorize, understand, and view relationships between the nodes in network terms. It is an effective way to analyze connections and check their capabilities in any field such as prediction, marketing analysis, and healthcare, etc.
Open data is data anyone can use and share. It has an open license, is openly accessible, and is both human-readable and machine-readable.
You’re probably already using open data every day – for example:
• Geospatial information (in getting from point A to point B)
• Weather data (in deciding how to dress for the day)
The data that can be created, stored, processed, analyzed, and visualized instantly i.e. in milliseconds, is known as real-time data.
It is the big data term that defines the data used to describe an object along with its properties. The object described by reference data may be virtual or physical.
The big data term used for Software-as-a-Service. It allows vendors to host an application and then make this application available over the internet. The SaaS services are provided in the cloud by SaaS providers.
The data, not represented in the traditional manner with the application of regular methods is known as semi-structured data. This data is neither totally structured nor unstructured, but contains some tags, data tables, and structural elements. A few examples of semi-structured data are XML documents, emails, tables, and graphs.
In the most general sense, Structured Data is information (data) that is organized (structured). Structured data is organized information.
The process of the application of linguistics, machine learning, and statistical techniques on text-based sources. Text analytics is used to derive insight or meaning from the text data by the application of these techniques.
The data for which structure can’t be defined is known as unstructured data. It becomes difficult to process and manage unstructured data. The common examples of unstructured data are the text entered in email messages and data sources with texts, images, and videos.
This term defines the value of the available data. The collected and stored data may be valuable for societies, customers, and organizations.
The total available amount of the data. The data may range from megabytes to brontobytes.
The data trends and patterns that help to track the atmosphere is known as weather data consisting of numbers and factors. Real-time data is available to be used in several different contexts, such as a logistics company that use weather data to optimize goods transportation.
Check out our other articles!
HUR REDO ÄR DEN NORDISKA LIVSMEDELSINDUSTRIN FÖR AI? AI i livsmedelsindustrinI en värld som alltmer förlitar sig på artificiell intelligens för innovation och effektivitet, finner sig livsmedelsindustrin vid en kritisk punkt. I den här artikeln ställer vi oss frågan:...
SYSTEMLÖSNINGAR FÖR MASSBALANS: TEKNIKENS ROLL I ATT OPTIMERA LIVSMEDELSPRODUKTION ERP & IntegrationMassbalans är konsten att balansera insatser med utdata – en grundläggande men komplex uppgift. Idag är smarta systemlösningar inte enbart till för att förbättra...
9 FRÅGOR OM AI INOM LIVSMEDELSINDUSTRIN: INSIKTER FRÅN EN DATA SCIENTIST AI i livsmedelsindustrinVi pratade med Niclas Lovsjö, vår interna guru inom datavetenskap, om AI i livsmedelsindustrin. Vi ställde 9 frågor till honom, för att få hans expertåsikt om allt från...