Tag Archives: R

Multivariate statistics for hydrogeology: moving forward from “the present is the key to the past”

Pengantar

Makalah ini kami tulis sebagai refleksi dari perjalanan riset selama kurang lebih 15 tahun. Beranjak dari penelitian geologi, mengarah ke hidrogeologi, hingga akhir mengaplikasikan beberapa metode statistik untuk mengidentifikasi perilaku air tanah di dalam akuifer. Abstrak ini telah dikirimkan ke ICAS 2016.

Berawal dari metode penggambaran secara grafis Piper pada tahun 1944, kemudian hadir grafik Schoeller dan Stiff, yang ketiganya merupakan visualisasi kualitas air menggunakan pendekatan statistik multivariabel. Hingga belakangan ini terminologi machine learning sangat mengemuka di berbagai bidang. Berawal di bidang biologi dan kedokteran saat ini mulai digunakan di bidang hidrogeologi. Salah satunya untuk mengenali karakter kualitas air tanah dan memisahkannya menjadi beberapa kelompok yang sejenis.

Perkembangan inilah yang coba kami tuliskan dalam makalah ini, semoga bermanfaat. Saat ini kami masih menyusun full paper-nya.

Continue reading Multivariate statistics for hydrogeology: moving forward from “the present is the key to the past”

Attention to participants: Intro to R and Reference Management

WTF4

(image from: writingclassesforkids.com)

For Intro to R and Reference Management course participant.

Be sure to download and install all the required softwares prior to enter the class.

1) R base from cran.r-project.org
2) R Studio from rstudio.com
3) zotero app, zotero addins for firefox, zotero addins for MsWord/LibreOffice from zotero.org

See you later in class.

View on Path

matrix aggregation: Upcoming post on R from dummies

Dear friends,

After a long hiatus, here’s my new post (in Bahasa Indonesia) about matrix aggregation. This article is a collaboration between me and Ali Akbar Hakim (@osairisali) a fellow R user from Faculty of Economic Brawijaya University Indonesia. His major is Economic, and currently finishing his undergrad thesis in Input-Output Economic Analysis.

The article is currently under final revision, but the following are a few snapshot of it.

Screen Shot 2014-12-07 at 6.16.54 AM

Screen Shot 2014-12-07 at 7.04.32 AM

Screen Shot 2014-12-07 at 7.05.14 AM

Several more things about data analysis

Outline

Part 1: Data in business

Author: Dasapta Erwin Irawan

1.1 Introduction

In this online era, we are surrounded by something called data. Back then, data is only considered to be related to laboratory works, school projects, etc. Now, data is all around us. Data was only something we measure, but now it is something we trade as goods. People is interested to anything that can be converted to data observation or measurement. Some says data is the by product of digital existence. People tend to analyse anything. They even interested to the rise and fall of the name “Jennifer” being used as girl’s name through out time. Per say, data is now an everyday talk in coffee shops. Or maybe in the wet market, when people talks about the rise and fall of cabbage price.

1.2 Data in business: why is it so important

1.2.1 Forecasting: we need to know the future

What’s all the excitement about data analysis. Forecasting is one thing. People always need to know what happen to the future, given with the existing condition as baseline with some assumptions and chances of disruption along the way. Forecasting is one of the main part of business. So important, that a business proposal would likely to be thrown away if it does not contain any data-based forecasting.

A time series is a collection of measurements of well-defined items obtained through repeated measurements over time. For example, measuring the value of retail sales each month of the year would comprise a time series. This is because sales revenue is well defined, and consistently measured at equally spaced intervals. Data collected irregularly or only once are not time series. A time series can be decomposed into three components: trend component (long term direction), seasonal component (systematic, calendar related movements) and irregular (unsystematic, short term fluctuations). The decomposition is important in data analysis. Because what we see in the chart doesn’t necessarily happen in real life. For instance, let’s see the chart of turkey sale. It would likely to be high around Thanksgiving. So our first guess would be it’s a cyclic phenomenon. But what if there’s a shift in the time range in a certain year. We wouldn’t expect that the data of Thanksgiving had been shifted. What if there’re more than two peaks of turkey sale in a year, and so on.

Let’s think of data as something that has its own behaviour. It may have a rhythmic natural-born behaviour or erratic. It may have a stiff and dull attitude, that insensitive to external influence, or it may have a flexible nature and very sensitive to outside parameters. Or, it just may have an erratic behaviour without having any major controlling parameter. As you can see at the following picture, forecast is part of a loop. It analyses and transforms performance into decision making inputs. This loop drives the evolution in a business model or organisation.

forecast

Fig 1 A business cycle involving forecast in the loop (from: University of Baltimore web page)

1.2.2 Decision making: reaction to every action

There is reaction to every action

I’m not into physics by that, as probably most of you too, but like an organisation, business is always changing. They evolve through the test of time. The Apple Corp now we see, is not the same as it was back in the 70’s. Or as oppose to the fore-mentioned proprietary vendor, let’s see the open source software. Let’s say Linux, an open source operating system. Firstly built as a personal computer science project by a personal Finnish student, namely Linus Torvalds, Linux now is a multi billion dollar business. You can see that a technology that started as a free for all technology, is being transformed to profit-oriented object. The surprising news is, the free-for-all Linux is still marking its way along with its commercial side. This product can change the way people see free stuff. The operating system itself is still free until now, but it is the service of building based on time series data. The operating system itself is still free until now, but it is the service of building and maintaining the Linux system that is highly commercial. See, that’s evolution and it involves data.

In the next article, we are going to talk about how we can see correlation between parameters in a data set, how we build a model, and then in the last article, we will discuss about a new profession called data scientist will be discussed.

Part 2: Looking for correlation

We have discussed how data can change the form of a business or an organisation. All the changes that might happen are based on data forecast. Now we’re going to talk about the second reason why data is important in business. Instead of only seeing a time series chart, people also needs to know what correlation can be drawn from it. Let’s just use an example.

A simple correlation case was brought by Bryant and Smith in a paper entitled Practical Data Analysis: Case Studies in Business, in 1995. They showed a case of data set containing measurements taken on dining parties in a restaurant by a single waiter. The variables include total bill (\(), tip (\)), gender of the bill payer, day of the week, and the tip as a percentage of the total bill. They wanted to see what variable or variables has or have the strongest influence to total tip in a week. They also compared the tip from male and female customer.

We can see in the chart, that total bill size and tip are positively associated (upper left scatter plot), but not as strongly as one might expect because there is increasing variability in tip as bill increases. Both tip and total bill have skewed distributions (upper left histograms), which might lead the analyst to consider log-transforming these variables.

Males spend more on average than females and bills are higher on the weekend (shown in the side-by-side box-plots). The 70% tip on a very small bill by a male on a Sunday may be an outlier. Much can be learned about tipping behaviour by studying this chart.

waiter

Fig 2 An example of correlation chart of multiple variables (from: National Library of Australia)

But we must put into account that correlation doesn’t always mean causation. If we see the above-mentioned case, indeed there’s a correlation between male and female customer and their tipping behaviour. But what drive the attitude had not been discussed yet. It generally lie underneath the number, that we have to dig out.

Many studies are actually designed to test a correlation, but not a causation. In general, it is extremely difficult to establish causality between two correlated observations, but on the other hand, there are many statistical tools to establish a statistically significant correlation.

You would be surprise how common sense conclusions about cause and effect might mostly be wrong. That is because a correlation can be due to two frequent correlated occurrences. Or a correlation may also be observed when there is a strong causality behind it, for example, it is well-known that cigarette smoking not only correlates with lung cancer, but actually causes it. But the hardest part is, in order to establish cause, we would have to rule out the possibility that smokers are more likely to live in urban areas, where there is more pollution — or any other possible explanation for the observed correlation.

So we can say that, causality can be started from series correlations. But we have to add some controlled variable in the analysis. As shown in the smoking example. We have to set the assumptions and narrow down the potential governing variables. We call the result as a model.

In the 3rd and 4th part of this “Data Talks” article, we are going to talk about “Model” and “Data Scientist”,.

Part 3: How to build a model

The model is the most basic element of the scientific method. And business is just as close as physics in science. Probably without noticing, we’ve talked about “model” in the previous “Forecast” and “Decision Making” parts. Both terms are brought by mathematical models in form of equation. You must know about linear regression (see the following figure) or remembered learning this subject in algebra. It is just one model among many others that does the actual forecasting for us.

linear

Fig 3 An example of linear regression model (from: National Library of Australia)

We also talk about model when we saw the business loop diagram in previous article or if we buy our children Hot Wheels or Barbie. Even a recipe is a model. So we could say a model as an simplification for what we are actually studying or trying to predict.

This is how we build a model:

  1. Data gathering. We talked about it in the forecast article. It can be a long time series, as long as, a rainfall data set, or from a questionnaire.
  2. Setting the assumptions. Most model only work in a controlled environment. Therefore we have to set the boundaries. The more boundaries, the more narrow our model will be. How many boundaries we should have? Answering this could be an itterative process with step number 3 and 4.
  3. Model fitting. This is the fun part. We can use major proprietary software like Stata, SPSS, and SAS, or you can choose the free one, like R. Those software contain many equation models that we can pick and test later on.
  4. Model calibration. This part also automatically done by softwares. Basically, we apply our chosen equation to a new data. If the result behave the same way with our modelled-data, the one we used in step number 3, then we can say our model is actually working. If not, then we have to go back to step no 3 or even number 2.
  5. Model application. This is the phase that we like the most. But, through time, we have to evaluate our model, based on the current situation.

Another thing we have to bare in out mind is, the Law of Simplicity. The simplest model has higher chance to be received in business environment. Top executive would probably put less care about model with 11 variables. Two or three variables model is frequently chosen by a data scientist of previously known as data analyst. In the 4th part of this “Data Talks” article, we are going to talk about “Data Scientist”, a new blossoming career for mathematicians, statistician or computer scientists.

Part 4: Data scientist

It was not until five years a go, people invented a new kind of profession, called “data scientist”. A data scientist represents an evolution from the business or data analyst role. A solid basics typically in computer science and applications, modelling, statistics, analytics and math. We are talking about a one powerful career that can predict the future, talk about it, and persuade others. A good data scientists will not just address business problems, they will pick the right problems that have the most value to the organisation.

datasci

Fig 4 A profile of data scientist (from:Emc^2 web site)

The work of a data scientist would more or less cover the following aspects (extracted from a coursera forum):

  • Formulate context-relevant questions and hypotheses to drive data scientific research
  • Identify, obtain, and transform a data set to make it suitable for the production of statistical evidence communicated in written form
  • Build models based on new data types, experimental design, and statistical inference

Aside to the proficiency in computer science, math and statistics, a good data scientist must have the curiosity, creativity, focus and attention to detail.

Data scientist is always needed as far as there’s data involve in an operation. Companies that hire data scientist include:

  • Construction companies
  • Utility companies
  • Oil, gas and mining companies
  • Hospitals and health care organisations
  • Colleges and universities
  • Federal, provincial/state and municipal government departments
  • Transportation companies
  • Telecommunications companies
  • Insurance, finance and banking organisations
  • Management consulting companies
  • Manufacturing companies

As we conclude our talk on data, it’s clear that

Numbers are not just numbers

They can speak

And it’s up to us to listen

 

 

Blogpost arrangement

UWS lake
UWS lake

 

 

 

 

 

 

 

 

[image from: personal collection, a path way in UWS MacArthur Campus]

Dear friends,

It’s another 10 degrees morning in Sydney. I know the number is fairly warm for most of you, but as an Indonesian, it needs an effort to keep the finger to hit the right key.

After thinking about the position of my blogs, I will:

  1. gradually this blog, my oldest blog, about Linux and other tech-internet
  2. focus my posts on R and data analysis on my Blogger site
  3. gradually move my posts about hydrology and research/teaching to my other new WordPress blog, My Online Water Books. I try to post the pdf format in every post.

I’ll post all of my updates on my Twitter (@dasaptaerwin) and Google Plus (+Dasapta Erwin Irawan).

Thank you. All the best for you all.