The role of Data Scientist is new and important. Big Data is seen as the key area for innovation, and the Data Scientist is a key role in putting Big Data to work.
So who are the Data Scientists, what do they do and does it include skills that we as developers may want to acquire?
The Data Scientist mines data for useful insights. It’s a role that is closely related to the Computer Scientist, the role often fulfilled by developers like us.
Last month we looked at the tools used by the Computer Scientist, this month we look at the skills and tools needed by the Data Scientist.
The Data Scientist Role
In his Book “Data Visualisation – a Successful Design Process” Andy Kirk identifies the “Eight Hats” of data visualisation design:
- The Initiator – The leader who is seeks a solution.
- The Data Scientist – The data miner, wearing a miners hat, discovering nuggets of insight buried deep within the numbers.
- The Journalist – the story teller who refines the insight with narrative and context.
- The Computer Scientist – The person who breaths life into the project with their breadth of software and programming literacy.
- The Designer – With an eye for visual detail and a flair for innovation they work with the computer scientist to ensure harmony between form and function.
- The Cognitive Scientist – Brings an understanding of visual perception, colour theories and human-computer interaction to inform the design process.
- The Communicator – The negotiator and presenter who acts as the client-customer-designer gateway.
- The Project Manager – The co-ordinator who picks up the unpopular duties and makes sure that the project is cohesive, on time and on message.
These are hats, and we will probably find ourselves wearing several of them over time. As you can see, Data Visualisation requires us to pull together a range of disciplines in order to achieve something meaningful.
Last month we focused on the skills of the Computer Scientist, looking at the skills needed to pull the data out of the repository and put it in front of the audience.
This month we are looking at the skills of the Data Scientist. Here’s Kirk’s full description:
The data scientist is characterized as the data miner, wearing the miner’s hat. They are responsible for sourcing, acquiring, handling, and preparing the data. This means demonstrating the technical skills to work with data sets large and small and of many different types. Once acquired, the data scientist is responsible for examining and preparing the data. In this proposed skill set model, it is the data scientist who will hold the key statistical and mathematical knowledge and they will apply this to undertake exploratory visual analysis to learn about the patterns, relationships, and descriptive properties of the data.
Last month we talked about data being the new soil. The data scientist is a miner who digs down deep. It is a pivotal roll in the design process. Kirk elaborates further:
If we don’t have the data we want, or the data we do have doesn’t tell us what we hoped it would, or the findings we unearth aren’t as interesting as we wish them to be there is nothing we can (legitimately) do about it. That is an important factor to remember. No amount of 3D-snazzy-cool-fancy-design dust sprinkled on to a project can change that.
An incomplete, error strewn or just plain dull dataset will simply contaminate your visualization with the same properties. So, the primary duty for us now is to avoid this happening, remove all guessing and hoping, and just get on with the task of acquiring our data and immerse ourselves into it to learn about its condition, its characteristics, and the potential stories it contains.
This month we’re going to look at some of the tools we can use as Data Scientists to immerse ourselves in the data. Tools that will help us to interact with our data, drill down into it’s seams and discover what nuggets lie within.
If you want to have a chance of winning one of this months books then please sign up on the Meetup page. At the end of May the lucky winner will get to choose a physical copy and the runner up can select an ebook.
For a second month we are going to look at Andy Kirk’s “Data Visualisation – a Successful Design Process.” It’s a great introduction to using Data Visualisation in your applications and the key text behind this series of competitions.
Kirk provides us with a structured approach to what can appear like a dark art. The task of data familiarisation, for example, is organised into the following steps:
- Acquisition – Getting hold of the data.
- Examination – Assessing the data’s completeness and fitness.
- Data Types – Understanding the properties of the raw material. (Not to be confused with Data Types in our code.)
- Transforming for Quality – Tidying and cleaning, filling in the gaps.
- Transforming for Analysis – Preparing and refining for final use.
- Consolidating – Bringing it all together, mashing it up with other sources.
Let’s start at the ending: the consolidation of data to create an effective visualisation, like the one above.
Highcharts provides interactivity, allowing the user’s to become data scientists and explore the data for themselves.
This is an easy approach for simple data mining needs, but the real value is in mining the rich seams of complex data through Data Analysis. For this some powerful tools are needed.
Anybody looking for a ‘real world’ use of Clojure should take a look at the Incanter libraries and the practical value they provide in the first phases of Data Science. This is a Clojure cookbook that is full of solid, practical recipes for dealing with large datasets. It shows you how to go beyond spreadsheets to deal with data on new scales of size and complexity.
The book is particularly strong on recipes for acquisition and transformation for quality and analysis. The first chapter will show you how to pull in your data from a whole range of data sources, including JSON, XML, CSV, JDBC and Excel. The second chapter will show you how to clean up your data with tools like regular expressions, synonym maps, custom data type parsers and the Valip validation library.
Eric Rochester’s Cookbook provides sound, practical recipes. If you want to practical introduction to Data Analysis that will get you up, running and productive quickly then this is the place to start.
It also touches on a whole range of other related topics, such parallel programming, distributed processing and machine learning.
It isn’t so strong on the theoretical side of data analysis. There’s a whistlestop tour of linear and non-linear relationships, Bayesian distributions and Bneford’s law in chapter 7. Chapter 9 introduces Weka for machine learning. In between chapter 8 shows you how to interface with Mathematica or R.
If you want to learn more about the theory of data analysis you may want to consider working with R directly. R is the lingua franca of statistics and learning it will give you access to a wealth of resources available on the web.
In this Beginners guide the authors R John and M. Quick will show you how to get up and running with R. The material is more abstract, with talk of standard deviations, linear models and ANOVA. However, the authors make it more entertaining with a bit of role play. Your are the lead strategist for a kingdom who must gather your intelligence, prepare the battle plans and brief the emperor and his generals.
When it comes to learning the mathematical theory the book doesn’t go much deeper than the Data Analysis Cookbook. However, it does present the information in an entertaining way and by learning R you open the door to working directly with a tool used by mathematicians rather than programmers.
Statistical analysis has been around for a long time, but it is now being performed with more data than ever before. Companies like Google and Facebook are now working with data on an unprecedented scale and that is why there is so much buzz about Big Data.
If you want to work with Big Data, processing massive data sets measured in the terabytes, then the essential tool to learn is Map Reduce.
The authors are well qualified. Srinath Perera is a Senior Software Architect at WSO2 and has a Ph.D. Thilina Gunarathne is a Ph.D. candidate at the School of Informatics and Computing of Indiana University. The have provided 90 recipes, presented in a simple and straightforward manner, with step-by-step instructions and real world examples. These recipes guide you through the complex business of getting Hadoop up and running and then not only demonstrate what MapReduce is but how it can be applied to problems such as analytics, indexing, searching, classification and text processing on a massive scale. Along the way you will be exposed to the tools and techniques that are fundamental to working with big data.
While Hadoop is implemented in Java, and offers a Java API it doesn’t reallly sit within the Java ecosystem. Using Hadoop requires the learning of a whole new eco system. To use it properly you’ll need to get to know complementary apache projects such as HBase, Hive, and Pig.
If you want to get to be able to experiment with MapReduce and distributed computing while staying firmly within the Java ecosystem then consider the Infispan data grid platform. Installing Infinispan is as easy as installing jBoss AS7 and you can use it to provide persistance for your standard CDI applications without alteration. The authors are Java people. Francesco Marchioni has written several books on the JBoss application server and Manik Surtani is the specification lead of JSR 347 (Data Grids for the Java Platform).
The book offers practical guidence to get you up and running with Infinispan platform. While none of them deal with MapReduce, they will leave you well equipped to follow the online documentation.