Tag Archives: writing

WTF: APAKAH CSL ITU?

What is csl?

Table of Contents

Naskah ini merupakan draft awal, bagian dari buku “Menulis (ilmiah) itu menyenangkan”. Artikel pendek ini diilhami diskusi yang kami ikuti beberapa hari lalu mengenai citation style language dalam format penulisan tugas akhir. Seperti biasa, dokumen ini ditulis dalam text mode menggunakan Emacs org-mode, tanpa Ms Word. Semoga bermanfaat.

1 Gaya sitasi

Anda pasti pernah melihat Daftar Pustaka. Isinya adalah identitas lengkap rujukan yang telah anda gunakan dalam teks. Mungkin isinya akan seperti ini:
`

Irawan, DE., Silaen, H., Sumintadireja, P., Lubis, RF., 
Brahmantyo, B., and Puradimaja, DJ. (2014). 
Groundwater-surface water interactions of Ciliwung River streams, 
segment Bogor-Jakarta, Indonesia, Environmental Earth Sciences, 
73(7). doi:10.1016/j.obhdp.2007.08.002

Kalau anda bandingkan antara dua jurnal yang berbeda, seringkali gaya penulisan pustakanya berbeda. Ini karena di dunia setidaknya dikenal ada empat gaya sitasi:

  • APA
  • MLA
  • Harvard
  • Vancouver

Masing-masing memiliki format tersendiri untuk penulisan rujukan maupun teknik menulis rujukan dalam teks. Bila rujukan yang anda gunakan, hanya lima atau 10, mungkin tidak masalah kita ketik manual. Tapi kalau jumlah rujukannya sudah 50 bahkan lebih, maka anda akan memerlukan aplikasi yang mendukung citation management.

Seperti yang telah sering saya jelaskan, bahwa komponen dari citation manager terdiri dari:

  1. aplikasi citation manager, misal: Zotero, Mendeley, EndNote,
  2. konektor ke perambah (/browse/r): anda bisa menggunakan Google Chrome, Mozilla Firefox, atau Safari,
  3. konektor ke pengolah kata: anda bisa menggunakan LibreOffice, TexStudio (LaTex), atau Microsoft Office.
  4. citation style language (csl) file: file ini ada yang sudah terinstalasi di dalam aplikasi citation manager, tapi anda akan perlu menginstalasi file csl tambahan untuk sesuai dengan permintaan jurnal tujuan anda menulis.

2 File CSL

CSL adalah singkatan dari Citation Style Language. Ini adalah file text (ASCII) yang dapat anda buka dengan aplikasi “Notepad” biasa. Sesuai namanya file ini menyimpan gaya sitasi. Biasanya nama filenya adalah nama akan sesuai dengan gaya sitasi, misal: hydrogeology-journal.csl.

Isinya kurang lebih adalah sebagai berikut:

<style xmlns="http://purl.org/net/xbiblio/csl" version="1.0" default-locale="en-US">
<!--
 Generated with https://github.com/citation-style-language/utilities/tree/master/generate_dependent_styles/data/springer 
-->
<info>
<title>Hydrogeology Journal</title>
<title-short>Hydrogeol J</title-short>
<id>http://www.zotero.org/styles/hydrogeology-journal</id>
<link href="http://www.zotero.org/styles/hydrogeology-journal" 
rel="self"/>
<link href="http://www.zotero.org/styles/springer-
xbasic-author-date" 
rel="independent-parent"/>
<link href=
"http://www.springer.com/cda/content/document/cda_downloaddocument/
Key_Style_Points_1.0.pdf" 
rel="documentation"/>
<link href="http://www.springer.com/cda/content/document/
cda_downloaddocument/manuscript-guidelines-1.0.pdf" rel="documentation"/>
<category citation-format="author-date"/>
<category field="science"/>
<issn>1431-2174</issn>
<eissn>1435-0157</eissn>
<updated>2014-05-18T01:40:32+00:00</updated>
<rights license="http://creativecommons.org/licenses/by-sa/3.0/">
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License
</rights>
</info>
</style>

Mumet bukan?

Dengan demikian nama csl harus diketahui sebelum anda mengunduhnya
dari repositori csl, dalam hal ini kita akan menggunakan citation
manager
Zotero yang repositorinya ada di tautan:
http://www.zotero.org/styles. Jadi daripada harus belajar programming
csl file, lebih baik, cari namanya kemudian unduh filenya.

zoterorepo

Figure 1: Tampilan repositori csl Zotero

3 Aliran kerja citation management

Secara sederhana, pengelolaan sitasi atau citation management dapat dijelaskan dalam diagram alir sebagai berikut. Komponen utamanya adalah:

  • aplikasi citation management,
  • connector to browser dan connector to word-processor.

Aliran kerja diawali dengan anda mencari rujukan yang anda perlukan melalui perambah/browser (anda dapat menggunakan Chrome, Firefox, atauapun Safari). Umumnya para pengguna akan mengunjungi situs database ilmiah sebagai berikut:

  • Google Scholar
  • Scopus
  • Sciencedirect
  • Proquest
  • dll

citmanage

Figure 2: Aliran kerja citation management

Saat anda menemuka makalah yang anda perlukan, selain mengundung pdf filenya, yakinkan anda mengunduh citation info (cari tombol “Export” atau “Export” citation), pilih format citation info yang sesuai dengan citation manager yang anda gunakan. Bila anda memilih format “BibTex” atau “text file”, maka umumnya semua aplikasi citation manager akan mengenalnya. Citation info adalah metadata dari setiap rujukan yang anda unduh. Nantinya, citation info akan di-import ke dalam aplikasi citation manager anda, hingga menghasilkan item sitasi yang siap untuk dirujuk dalam teks. Anda dapat pula menempelkan pdf file ke item sitasi yang bersangkutan.

Cara lainnya, adalah, bila anda menggunakan Zotero dan telah menginstalasi browser connector, anda dapat langsung meng-klik tombol bergambar “huruf Z” atau “folder warna biru”. Hasilnya akan muncul form pilihan sitasi mana yang diperlukan, atau klik saja “Select all”. Kemudian secara otomatis, browser connector akan mengunduh citation info dan pdf file (bila memang tersedia) dan memasukkannya ke dalam library di aplikasi Zotero. Akan lebih baik anda buka terlebih dahulu Zotero, pilih library atau koleksi target penyimpanan atau buat baru, sebelum anda melakukan langkah tersebut di atas.

Hasilnya dapat langsung anda lihat di koleksi Zotero anda. Semua citation info dan pdf file akan tersimpan di dalamnya. Mudah dan cepat bukan.

4 Contoh

Bila ada kasus seperti ini:

Dalam template tesis/disertasi ada format sitasi yang dibakukan. Pertanyaan: apa nama gaya sitasinya? Misal APA, MLA, Harvard? Atau merujuk ke jurnal-jurnal tertentu.

Bila anda menghadapi hal ini, setidaknya anda harus mengetahui bidang ilmunya, misal: apakah ilmu kebumian (earth sciences) atau ilmu alam (natural sciences). Bila ada sudah tahu, maka anda dapat mengunjungi repositori csl dari citation manager yang anda gunakan. Bila anda menggunakan Zotero, maka anda harus ke http://www.zotero.org/styles. Masukkan kata kunci bidang ilmunya, bila anda sorot tautan hasil pencarian, maka akan muncul preview dari gaya sitasi yang bersangkutan. Kemudian anda cocokkan saja dengan contoh yang diberikan.

Namun bila anda sudah melakukan hal di atas dan ternyata tidak ditemukan gaya sitasi yang mirip, maka mungkin bidang ilmunya keliru. Coba kata kunci yang lain. Cara lain bila sudah buntu, ya harus belajar bagaimana membuat csl sendiri. Pilih saja gaya sitasi yang paling mirip, kemudian anda edit. Saya belum dapat memberikan tutorial karena memang saya juga masih belajar.

Sebenarnya karena csl file adalah text file dan punya format baku, maka anda dapat bebas mengikuti tutorial yang banyak tersedia di dunia maya. Tapi kalau anda menggunakan Zotero, saya sarankan anda kunjungi tautan berikut: https://www.zotero.org/support/dev/citation_styles/style_editing_step-by-step.

WTF: Bagaimana Indonesia “ditemukan”? SEO for Academics

Dasapta Erwin Irawan,

Institut Teknologi Bandung

loupe

Gambar 1 Loupe dari flickr/alainbachellier

Tulisan pendek ini adalah lanjutan dari tulisan saya berjudul Mengangkat nama Indonesia dari tulisan. Bila ingin format pdf-nya, bisa mampir ke [sini]{http://goo.gl/9PJpWD). Kali ini saya akan bercerita tentang bagaimana Indonesia ditemukan. Karena saya bukan ahli sejarah, maka kata-kata tersebut mohon tidak diartikan secara harfiah.

1 Pendahuluan

Memang bagian ini tidak wajib. Tapi saya suka dan harus membuat pendahuluan.

1.1 Search Engines

Apakah search engine itu?

Search engine adalah Google. Itu untuk mudahnya. Ini sebenarnya adalah aplikasi yang bertugas mencari dan menganalis apa saja yang dimasukkan penggunakan di kolom pencari (search column).

Apakah ada selain Google?

Ada:

  • Microsoft punya Bing
  • Buat yang lahir tahun 70-80an pasti kenal Yahoo, Altavista (sudah hilang), dan Lycos. Sepertinya masih ada.

1.2 Scientific databases

Apa lagi ini?

Kalau anda ingin mencari secara spesifik material sainfitik, anda bisa menggunakan scientific databases. Dua diantaranya adalah:

1.3 Bagaimana aplikasi itu bekerja?

Kalau bertanya secara teknisnya, saya tidak bisa menjawab, karena bukan lulusan IT atau computer science. Yang jelas aplikasi-aplikasi tersebut akan mencari kata kunci yang telah dimasukkan oleh pengguna.

Mereka akan membuka database dan mencocokkan dokumen mana yang mengandung kata kunci itu.

1.4 Di bagian mana mencarinya?

Pertanyaan bagus. Di sinilah mulai saya bahas “bagaimana Indonesia ditemukan”

Let’s do a role playing.

2 Menemukan Indonesia?

Ready?…

Let’s do a role playing.

Sebut saja anda adalah mahasiswa S3 di salah satu perguruan tinggi (PT) di luar negeri (LN). Anda ingin meneliti tentang air tanah di Bandung, maka ia akan mencoba membuat literature review. Apa ini? Baca di sini, di sini, dan di sini.

Untuk itu ia mulai membuka beberapa database saintifik, sebut saja Google Scholar dan Scopus. Kalau ingin tahu lebih banyak, tentang Scopus bisa baca dan unduh slide saya di SlideShare.

Ia mulai mengetik beberapa kata kunci:

  • “air tanah Indonesia”
  • “air tanah Bandung Indonesia”
  • “air tanah Bandung”
  • dst

Apa yang sama dari ketiga kata kunci di atas? Lokasi bukan. Ia mungkin akan mencari informasi dalam skala Indonesia, kemudian turun ke skala Kota Bandung.

Apa yang ia harapkan muncul? Makalah ilmiah yang judulnya mengandung kata-kata Bandung dan atau Indonesia bukan.

Jadi begitu besar pengaruh menyebut lokasi dalam judul. Kalau anda menulis apa saja, yakinkan bahwa anda sudah menyebut lokasi dalam judul.

Di mana lagi kata-kata Indonesia ditemukan oleh mesin pencari?

Mungkin akan ditemukan di bagian abstrak. Jadi saat menulis yakinkan ada lokasi dalam abstrak anda.

Di mana lagi?

Di bagian kata kunci (keywords) yang biasanya di bawah abstrak.

Ada lagi?

Ya, di bagian afiliasi penulis. Kalau anda lihat makalah ilmiah, maka afiliasi penulis biasanya tertulis setelah nama penulis atau kadang di bagian bawah kiri halaman (lihat gambar berikut).

Paper

Gambar 2 Anatomi paper. Diambil dari akun ResearchGate saya

3 Bagaimana kondisi saat ini?

3.1 Jumlah publikasi ilmiah

Bagaimana kondisi saat ini? Berapa banyak paper atau makalah yang ditulis oleh orang Indonesia atau orang-orang yang berafiliasi Indonesia?

Saya sampaikan saja hasil kompilasi dari database Scopus oleh Prof. Hendra Gunawan (Guru Besar Matematika ITB) dalam tweetnya berikut ini.

tweet

Gambar 3 Daftar peringkat perguruan tinggi produktif dalam membuat paper (menurut database Scopus)

Mohon tidak melihat institusi, namun lihatlah Indonesia secara keseluruhan. Masih kalah jauh bukan dengan tetangga sendiri, Malaysia.

Oya, hasil tersebut sangat mungkin akan berbeda bila kita menggunakan database Google Scholar (GS) atau Microsoft Academics (MA).

Kenapa?

Karena Scopus utamanya mencari informasi berjenis peer-reviewed paper atau prosiding seminar yang didaftarkan ke Scopus, sementara pencarian GS dan MA tidak hanya pada dua jenis paper tersebut.

3.2 Jumlah mahasiswa Indonesia di luar negeri

Berapa jumlahnya? Mari kita lihat informasi berikut.

 Program ini merupakan program panjang yang berkelanjutan oleh LPDP, dimana setiap tahunnya mereka memberangkatkan 3.000 putra putri terbaik bangsa.

"Tahun 2015 nanti diperkirakan jumlah lulusan yang pulang mencapai 900 orang, dan target yang diharapkan pada tahun 2030, LPDP melahirkan 60.000 pemimpin bangsa," ujar Direktur LPDP, Eko Prasetyo.

Dikutip dari laman Facebook LPDP

Jumlah di atas hanya dari Beasiswa LPDP. Masih ada banyak beasiswa lainnya, dari

  • Dikti,
  • Biro Kerjasama Luar Negeri Dikbud (maaf masih menggunakan nama kementerian yang lama karena sering gonta-ganti),
  • dll.

Sekarang apa hubungannya jumlah mahasiswa di luar negeri?

Akan saya jelaskan. Sabar.

3.3 Jumlah mahasiswa di LN vs jumlah publikasi?

Bukankah bagus banyak anak muda Indonesia menuntut ilmu di luar negeri.

Apa hubungannya dengan jumlah publikasi yang rendah?

Saya tidak menyangsikan dampaknya kepada Indonesia. Saya hanya akan menyoroti satu elemen saja kegiatan ilmiah mahasiswa kita di LN.

Kegiatan apa itu?

Menulis makalah ilmiah.

Apa pula masalahnya?

Ingat gambar anatomi paper (Gambar 2) dan ingat pula bagaimana mesin pencari menemukan Indonesia. Salah satunya adalah di afiliasi penulis. Mari kita lihat beberapa kemungkinan berikut:

  • Kasus no 1: Sang penulis adalah mahasiswa di PT dalam negeri (DN) dengan lokasi penelitian di LN.

Tidak perlu dibahas, karena jarang sekali terjadi.

  • Kasus no 2: Sang penulis adalah mahasiswa di PT dalam negeri (DN) dengan lokasi penelitian di Indonesia.

Maka mestinya ia akan menuliskan kata-kata Indonesia pada bagian judul, abstrak, dan kata kunci makalah.

Maka Paper ini akan muncul dalam pencarian dengan kata kunci seperti di atas.

  • Kasus no 3: Sang penulis adalah mahasiswa di PT LN dengan lokasi penelitian di Indonesia

Maka ia akan menuliskan kata-kata Indonesia pada bagian judul, abstrak, dan kata kunci makalah.

Maka Paper ini akan muncul dalam pencarian dengan kata kunci seperti di atas, tapi tidak akan menambah jumlah paper berdasarkan institusi dalam daftar di Gambar 3.

Lho kenapa? akan saya jelaskan.

  • Kasus no 4: Sang penulis adalah mahasiswa di PT LN dengan lokasi penelitian di LN

Ini sangat sering terjadi.

Maka ia tidak akan menuliskan kata-kata Indonesia di bagian manapun dalam papernya.

Maka paper tersebut tidak akan muncul dalam pencarian dengan kata kunci seperti di atas.

4 Libatkan penulis yang berafilisasi lembaga di Indonesia

Kita akan fokus ke kasus no 3 dan 4. Pada kasus no 3, paper akan muncul dalam pencarian dengan kata kunci Indonesia tapi tidak akan menambah daftar pada Gambar 3, karena afiliasi yang tertulis dalam paper pasti afiliasi PT LN.

Untuk kasus no 4, paper hanya akan muncul bila penulis mencantumkan afiliasi lembaga Indonesia.

Bagaimana caranya?

Saat yang terpikir hanyalah melibatkan penulis dengan afiliasi lembaga di Indonesia. Idealnya si penulis tambahan ini harus bekerja di lembaga yang milik pemerintah Indonesia yang berlokasi di Indonesia. Ajak mantan pembimbing anda atau rekan anda yang sedang menuntut ilmu di Indonesia.

Tidak ada salahnya bukan.

Apakah supervisor anda di LN setuju?

Asal anda menyampaikan dengan baik, tidak ada alasan buat profesor itu untuk menolak. Selama urutan penulisnya betul.

** Kenapa?** baca tulisan saya tentang [Authorship}(http://goo.gl/yQZ9Tr) di sini dan di sini.

Jadi letakkan nama mitra penulis tambahan ini sesuai perannya, yang mestinya (most likely) akan jatuh di paling belakang. Tidak masalah bukan.

Dengan cara ini mesin pencari akan menemukan paper anda, one way or another.

5 Penutup

Begitulah cerita pendek ini. Cerita tentang Menemukan Indonesia. Mohon maaf bila masih banyak kesalahan (typos) dan kekurangan di sana-sini. Maaf juga gambar berukuran jumbo masih belum di-resize.

“Maklum masih draft 1”, kata mahasiswa saya yang sedang tugas akhir.

Salam, Erwin Institut Teknologi Bandung Find me on twitter (@dasaptaerwin)

Naskah ini dibuat dengan R-markdown.

Visor your supervisor

Author: Dasapta Erwin Irawan
 
supervisor
(image from: http://katdaley.blogspot.com)
 
One of the major role in your thesis is the role of your supervisor. Unfortunately supervisors are just like other ordinary persons who happen to have job and obligation to supervise your thesis work. So by referring them as persons, I meant they won’t be always at your side supporting you with a wide smile on their face. At many times you would find them just stood there with no instruction nor suggestions on how you do your thesis. So just bare with them, because they are one of your chance to get your degree.
 
Basically there are two types of supervisors, based on their time commitment:
  1. The busy ones: They are the kind you are likely to meet. 
  2. The not (yet) busy ones: They are mostly early time researchers at your uni. They would probably be the assistant of the busy one. If you have this kind as one of your supervisor, then you are lucky. But not for long, these early researchers won’t be staying not busy for long as they also find their way to upgrade their career.
Then if you divide them to how they like to communicate to their students, you would find the following two:
  1. The tech savvy ones: They can use technology in their work. This kind of supervisor uses emails, Skype sessions, and social medias to communicate their science as well as to reach out to their students. Opening pdfs, reading slides, compiling LaTEX script are not their biggest concern when dealing with you. Your research is number one. If you have this kind of supervisor. You are lucky, because they most likely can communicate with you whereever and whenever. A casual conversation might be their style, with a bit of a drawback for you, their messages can be sent in commong working hours. Unsuspected meeting place would also be one of your challenge to meet him or her, e.g: parking lot, airport, train station or even a dark alley. You must be aware 24/7. 
  2. The non tech savvy ones: at first you will have a major headache when dealing with this kind of supervisor. You would need to have a formal-conventional meeting with your supervisor. A structured notes and materials are probably your main asset. Your might need to copy him/her weekly meeting schedule to make an appointment. Prepare your explanation carefully and develop a note taking technique, as this supervisor may be a fast talker and have very limited time for you. A formal conversation would be their style. 

No matter which kind of supervisor you have, you must be a quick learner to adapt his or her style. Remember to develop your verbal and writing communication skill.

 
Good luck.

Several more things about data analysis

Outline

Part 1: Data in business

Author: Dasapta Erwin Irawan

1.1 Introduction

In this online era, we are surrounded by something called data. Back then, data is only considered to be related to laboratory works, school projects, etc. Now, data is all around us. Data was only something we measure, but now it is something we trade as goods. People is interested to anything that can be converted to data observation or measurement. Some says data is the by product of digital existence. People tend to analyse anything. They even interested to the rise and fall of the name “Jennifer” being used as girl’s name through out time. Per say, data is now an everyday talk in coffee shops. Or maybe in the wet market, when people talks about the rise and fall of cabbage price.

1.2 Data in business: why is it so important

1.2.1 Forecasting: we need to know the future

What’s all the excitement about data analysis. Forecasting is one thing. People always need to know what happen to the future, given with the existing condition as baseline with some assumptions and chances of disruption along the way. Forecasting is one of the main part of business. So important, that a business proposal would likely to be thrown away if it does not contain any data-based forecasting.

A time series is a collection of measurements of well-defined items obtained through repeated measurements over time. For example, measuring the value of retail sales each month of the year would comprise a time series. This is because sales revenue is well defined, and consistently measured at equally spaced intervals. Data collected irregularly or only once are not time series. A time series can be decomposed into three components: trend component (long term direction), seasonal component (systematic, calendar related movements) and irregular (unsystematic, short term fluctuations). The decomposition is important in data analysis. Because what we see in the chart doesn’t necessarily happen in real life. For instance, let’s see the chart of turkey sale. It would likely to be high around Thanksgiving. So our first guess would be it’s a cyclic phenomenon. But what if there’s a shift in the time range in a certain year. We wouldn’t expect that the data of Thanksgiving had been shifted. What if there’re more than two peaks of turkey sale in a year, and so on.

Let’s think of data as something that has its own behaviour. It may have a rhythmic natural-born behaviour or erratic. It may have a stiff and dull attitude, that insensitive to external influence, or it may have a flexible nature and very sensitive to outside parameters. Or, it just may have an erratic behaviour without having any major controlling parameter. As you can see at the following picture, forecast is part of a loop. It analyses and transforms performance into decision making inputs. This loop drives the evolution in a business model or organisation.

forecast

Fig 1 A business cycle involving forecast in the loop (from: University of Baltimore web page)

1.2.2 Decision making: reaction to every action

There is reaction to every action

I’m not into physics by that, as probably most of you too, but like an organisation, business is always changing. They evolve through the test of time. The Apple Corp now we see, is not the same as it was back in the 70’s. Or as oppose to the fore-mentioned proprietary vendor, let’s see the open source software. Let’s say Linux, an open source operating system. Firstly built as a personal computer science project by a personal Finnish student, namely Linus Torvalds, Linux now is a multi billion dollar business. You can see that a technology that started as a free for all technology, is being transformed to profit-oriented object. The surprising news is, the free-for-all Linux is still marking its way along with its commercial side. This product can change the way people see free stuff. The operating system itself is still free until now, but it is the service of building based on time series data. The operating system itself is still free until now, but it is the service of building and maintaining the Linux system that is highly commercial. See, that’s evolution and it involves data.

In the next article, we are going to talk about how we can see correlation between parameters in a data set, how we build a model, and then in the last article, we will discuss about a new profession called data scientist will be discussed.

Part 2: Looking for correlation

We have discussed how data can change the form of a business or an organisation. All the changes that might happen are based on data forecast. Now we’re going to talk about the second reason why data is important in business. Instead of only seeing a time series chart, people also needs to know what correlation can be drawn from it. Let’s just use an example.

A simple correlation case was brought by Bryant and Smith in a paper entitled Practical Data Analysis: Case Studies in Business, in 1995. They showed a case of data set containing measurements taken on dining parties in a restaurant by a single waiter. The variables include total bill (\(), tip (\)), gender of the bill payer, day of the week, and the tip as a percentage of the total bill. They wanted to see what variable or variables has or have the strongest influence to total tip in a week. They also compared the tip from male and female customer.

We can see in the chart, that total bill size and tip are positively associated (upper left scatter plot), but not as strongly as one might expect because there is increasing variability in tip as bill increases. Both tip and total bill have skewed distributions (upper left histograms), which might lead the analyst to consider log-transforming these variables.

Males spend more on average than females and bills are higher on the weekend (shown in the side-by-side box-plots). The 70% tip on a very small bill by a male on a Sunday may be an outlier. Much can be learned about tipping behaviour by studying this chart.

waiter

Fig 2 An example of correlation chart of multiple variables (from: National Library of Australia)

But we must put into account that correlation doesn’t always mean causation. If we see the above-mentioned case, indeed there’s a correlation between male and female customer and their tipping behaviour. But what drive the attitude had not been discussed yet. It generally lie underneath the number, that we have to dig out.

Many studies are actually designed to test a correlation, but not a causation. In general, it is extremely difficult to establish causality between two correlated observations, but on the other hand, there are many statistical tools to establish a statistically significant correlation.

You would be surprise how common sense conclusions about cause and effect might mostly be wrong. That is because a correlation can be due to two frequent correlated occurrences. Or a correlation may also be observed when there is a strong causality behind it, for example, it is well-known that cigarette smoking not only correlates with lung cancer, but actually causes it. But the hardest part is, in order to establish cause, we would have to rule out the possibility that smokers are more likely to live in urban areas, where there is more pollution — or any other possible explanation for the observed correlation.

So we can say that, causality can be started from series correlations. But we have to add some controlled variable in the analysis. As shown in the smoking example. We have to set the assumptions and narrow down the potential governing variables. We call the result as a model.

In the 3rd and 4th part of this “Data Talks” article, we are going to talk about “Model” and “Data Scientist”,.

Part 3: How to build a model

The model is the most basic element of the scientific method. And business is just as close as physics in science. Probably without noticing, we’ve talked about “model” in the previous “Forecast” and “Decision Making” parts. Both terms are brought by mathematical models in form of equation. You must know about linear regression (see the following figure) or remembered learning this subject in algebra. It is just one model among many others that does the actual forecasting for us.

linear

Fig 3 An example of linear regression model (from: National Library of Australia)

We also talk about model when we saw the business loop diagram in previous article or if we buy our children Hot Wheels or Barbie. Even a recipe is a model. So we could say a model as an simplification for what we are actually studying or trying to predict.

This is how we build a model:

  1. Data gathering. We talked about it in the forecast article. It can be a long time series, as long as, a rainfall data set, or from a questionnaire.
  2. Setting the assumptions. Most model only work in a controlled environment. Therefore we have to set the boundaries. The more boundaries, the more narrow our model will be. How many boundaries we should have? Answering this could be an itterative process with step number 3 and 4.
  3. Model fitting. This is the fun part. We can use major proprietary software like Stata, SPSS, and SAS, or you can choose the free one, like R. Those software contain many equation models that we can pick and test later on.
  4. Model calibration. This part also automatically done by softwares. Basically, we apply our chosen equation to a new data. If the result behave the same way with our modelled-data, the one we used in step number 3, then we can say our model is actually working. If not, then we have to go back to step no 3 or even number 2.
  5. Model application. This is the phase that we like the most. But, through time, we have to evaluate our model, based on the current situation.

Another thing we have to bare in out mind is, the Law of Simplicity. The simplest model has higher chance to be received in business environment. Top executive would probably put less care about model with 11 variables. Two or three variables model is frequently chosen by a data scientist of previously known as data analyst. In the 4th part of this “Data Talks” article, we are going to talk about “Data Scientist”, a new blossoming career for mathematicians, statistician or computer scientists.

Part 4: Data scientist

It was not until five years a go, people invented a new kind of profession, called “data scientist”. A data scientist represents an evolution from the business or data analyst role. A solid basics typically in computer science and applications, modelling, statistics, analytics and math. We are talking about a one powerful career that can predict the future, talk about it, and persuade others. A good data scientists will not just address business problems, they will pick the right problems that have the most value to the organisation.

datasci

Fig 4 A profile of data scientist (from:Emc^2 web site)

The work of a data scientist would more or less cover the following aspects (extracted from a coursera forum):

  • Formulate context-relevant questions and hypotheses to drive data scientific research
  • Identify, obtain, and transform a data set to make it suitable for the production of statistical evidence communicated in written form
  • Build models based on new data types, experimental design, and statistical inference

Aside to the proficiency in computer science, math and statistics, a good data scientist must have the curiosity, creativity, focus and attention to detail.

Data scientist is always needed as far as there’s data involve in an operation. Companies that hire data scientist include:

  • Construction companies
  • Utility companies
  • Oil, gas and mining companies
  • Hospitals and health care organisations
  • Colleges and universities
  • Federal, provincial/state and municipal government departments
  • Transportation companies
  • Telecommunications companies
  • Insurance, finance and banking organisations
  • Management consulting companies
  • Manufacturing companies

As we conclude our talk on data, it’s clear that

Numbers are not just numbers

They can speak

And it’s up to us to listen