Becoming a data scientist
Data Pointed, CouchDB in the Cloud, Launching Strata
How do I become a data scientist?
Background: I recently finished my bachelor's degree in computer science at Berkeley. Although it may be a bit late, I am just now getting interested in learning more about statistics and "data science." Unfortunately, I don't have much of a math background (only took up to Linear Algebra) and the required probability/discrete math course for CS. Although I started working, I have the option of enrolling in an MS CS program in January. What courses should I be looking at and will a MS in Statistics be more useful? If so, is it possible to get into an MS in Statistics without a strong math background? I will probably be looking into taking machine learning and data visualization.
9 Answers • Stay updated about new answers by joining Quora
Alex Kamil
82 votes by Edwin Khoo, Anon User, Neil Kodner, (more)Edwin Khoo, Anon User, Neil Kodner, Anon User, Lakshmi Narasimhan Parthasarathy, Joseph Misiti, Scott Hurff, Ivan Cherevko, Tal Levy, Jiahao Chen, Brad Chapman, Paul Butler, Olivier Grisel, Viksit Gaur, Jason Adams, Zoltan Varju, Nikete Della Penna, Drew Conway, Martin Lindner, Ivo Anastácio, Simplicio Gamboa III, Sasha Katsnelson, Kevin Li, Ani Ravi, Anon User, Ilya Grigorik, Mark Folashade, Dan Knoepfle, Shalin Shekhar Mangar, Mat Kelcey, Dimitry Lukashov, Sudhir Kumar, Kiat Chuan Tan, Brian Tran, Byron Gibson, Chaitanya Sharma, Marck Vaisman, Sutha Kamal, Susheel Kiran J, Shanky Surana, Aleks Jakulin, Esteban Gutierrez, Shrey Gupta, Stormy Shippy, Khader Shameer, David Ouyang, Yinfeng Qin, Sam Gerstenzang, Jon Ingram, Adam Tait, Faraz Syed, Dhayanithi Subramanian, Johnson Hsieh, Andy Chen, Yunghui Lim, Teng Siong Ong, Aaron Ligon, Richard Minerich, Leonardo Galvao, Stephen Turner, Prakhar Agarwal, Przemyslaw Grabowicz, Oscar Celma, Itamar Herzberg, Rob Leathern, Moses Namkung, Jói Sigurðsson, Brian Luft, Luis Cielak, Akhil Ravidas, Arthur Tazhitdinov, Gary Tang, Arun Suresh, Radim Rehurek, Devin Dawson, Douglas Tarlow, Darren Geraghty, Ryan Humenick, Amr Muhammad, Varun Gupta, Ralph Barbagallo and Sidharth Shah
Strictly speaking, there is no such thing as "data science" (see
What is data science? ). See also: Vardi, Science has only two legs:
http://portal.acm.o rg/ft_gateway...
Here are some resources I've collected about working with data, I hope you find them useful (note: I'm an undergrad student, this is not an expert opinion in any way).
1) Learn about matrix factorizations:
Take the Computational Linear Algebra course (it is sometimes called Applied Linear Algebra or Matrix Computations or Numeric Analysis or Matrix Analysis and it can be either CS or Applied Math course). Matrix decomposition algorithms are fundamental to many data mining applications and usually underrepresented in a standard "machine learning" curriculum. With TBs of data traditional tools such as Matlab become not suitable for the job, you cannot just run eig() on Big Data. Distributed matrix computation packages such as those included in Apache Mahout [1] are trying to fill this void but you need to understand how the numeric algorithms/LAPACK/BLAS routines [2][3][4][5] work in order to use them properly, adjust for special cases, build your own and scale them up to terabytes of data on a cluster of commodity machines.[6] Usually numerics courses are built upon undergraduate algebra and calculus so you should be good with prerequisites. I'd recommend these resources for self study/reference material:
2) Start learning statistics by coding with R:
gling.com/som...
and UCI Machine learning repository: http://archiv e.ics.uci.edu/ml/
http://www.amazon.com/Ana lysis-R...
esian-C...
esian-C...
3) Learn about distributed systems and databases:
also http://research.google.co m/pubs/... and
http://www.umiacs.umd .edu/~jimmy...,
http://www.columbia.edu/ ~ak2834/...)
http://herpolhode.com/rob /
http://www.cs.princeton.e du/~bwk/
http://cm.bell-labs.com/w ho/dmr/
http://www.cs.columbia.ed u/~aho/
http://plan9.bell-labs.co m/who/ken/
http://www.informatik.uni -trier....
4) Learn about data compression
To be added
5) Learn about machine learning
Who are the best VCs in the field of analytics / data mining / databases?
Which companies have the best data science teams?
What are the notable startups in the news space?
Does the US Census have a data team?
Why do so many data geeks join web companies instead of solving large scale data problems in biology?
6) Learn about least-squares estimation and Kalman filters:
7) Check out these Q&A:
What are the best blogs about data?
What are the best Twitter accounts about data?
What are the best blogs about bioinformatics?
What are the best Twitter accounts about bioinformatics?
What is data science?
What are the best courses at MIT?
What are the best resources to learn about web crawling and scraping?
What are the best interview questions to evaluate a machine learning researcher?
What are the best resources for learning about distributed file systems?
What are some useful packages for working with large datasets in R?
What are some good books on stringology and pattern matching?
What's a good introductory machine learning text?
What is the best book to pick up working knowledge of theoretical statistics (assuming strong general math)?
Can anyone recommend a fantastic book on time series analysis?
What are the standard texts on linear regression?
What are some good books on random processes?
How has BigTable evolved since the 2006 Google paper?
What is a good source for learning about Bayesian networks?
What are the best data visualizations ever created?
What are some of the prediction and risk estimation models used by insurance companies?
How do scientists share data?
What are the best quant hedge funds?
What are the best books on econometrics?
What are the best introductory books on mathematical finance?
What is the best approach for text categorization?
What are the numbers that every engineer should know, according to Jeff Dean?
If you do decide to go for a Masters degree:
8) Study Engineering - I'd go for CS with a focus on either IR or Machine Learning or a combination of both and take some systems courses along the way. As a "data scientist" you will have to write a ton of code and probably develop distributed algorithms/systems to process massive amounts of data. MS in Statistics will teach you how to do modeling and regression analysis etc, not how to build systems, I think the latter is more urgently needed these days as the old tools become obsolete with the avalanche of data. There is a shortage of engineers who can build a data mining system from the ground up. You can pick up statistics from books and experiments with R (see item 2 above) or take some statistics classes as a part of your CS studies.
Good luck.
[1] http://mahout.apache. org/
[2] http://www.netlib.org /lapack/
[3] http://www.netlib.org /eispack/
[4] http://math.nist.gov/ javanumeric...
[5] http://www.netlib.org /scalapack/
[6] http://labs.google.co m/papers/ma...
[7] http://www.r-project. org/
[8] http://hadoop.apache. org/
Here are some resources I've collected about working with data, I hope you find them useful (note: I'm an undergrad student, this is not an expert opinion in any way).
1) Learn about matrix factorizations:
Take the Computational Linear Algebra course (it is sometimes called Applied Linear Algebra or Matrix Computations or Numeric Analysis or Matrix Analysis and it can be either CS or Applied Math course). Matrix decomposition algorithms are fundamental to many data mining applications and usually underrepresented in a standard "machine learning" curriculum. With TBs of data traditional tools such as Matlab become not suitable for the job, you cannot just run eig() on Big Data. Distributed matrix computation packages such as those included in Apache Mahout [1] are trying to fill this void but you need to understand how the numeric algorithms/LAPACK/BLAS routines [2][3][4][5] work in order to use them properly, adjust for special cases, build your own and scale them up to terabytes of data on a cluster of commodity machines.[6] Usually numerics courses are built upon undergraduate algebra and calculus so you should be good with prerequisites. I'd recommend these resources for self study/reference material:
- BellKor, Matrix factorization for recommender systems: www2.research.at
t.com/~volinsky/... - BellKor, Scalable Collaborative Filtering..: public.resea
rch.att.com/~volinsk... - Press et al., Numerical Recipes in C++: http://www.amazon.com/Nu
merical-... - Golub & Van Loan: Matrix Computations: http://www.
amazon.com/Computatio... - Watkins, Fundamentals of Matrix Computations (this is a very gentle intro to the field): http://www.amazon
.com/Fundamenta... - Demmel, Applied Numeric Linear Algebra: http://www.amazo
n.com/Applied-Nu... - Trefethen & Bau, Numerical linear algebra: http://www.amazo
n.com/Numerical-... - Watkins: The Matrix Eigenvalue Problem: GR and Krylov Subspace Methods: http://www.amazo
n.com/Matrix-Eig... - Parlett, The Symmetric Eigenvalue Problem: http://www.amazo
n.com/Symmetric-... - Iverson, Algebra as a language: http://www.jsof
tware.com/papers/... - Iverson, Algebra: an algorithmic treatment: http://www.ama
zon.com/Algebra-al... - Bertsekas, Parallel and Distributed Computation: Numerical Methods:http://www.amazon
.com/Parallel-D... - Hamming, Numerical Methods for Scientists and Engineers: http://www.ama
zon.com/Numerical-... - Bierman, Factorization Methods for Discrete Sequential Estimation: http://www.am
azon.com/Factorizat... - Wilkinson, The algebraic Eigenvalue Problem: http://www.amazo
n.com/Algebraic-... - Horn, Matrix Analysis: http://www.amaz
on.com/Matrix-Ana... - Harville, Matrix Algebra from a statistician perspective: http://www.a
mazon.com/gp/product... - Fiedler, Special Matrices: http://www.amaz
on.com/Special-Ma... - Higham, Accuracy and stability of numerical algorithms: http://www.am
azon.com/gp/product... - Langville & Meyer, Google Page Rank and Beyond: http://www.am
azon.com/Googles-Pa... - Nielsen, PageRank tutorial: http://michaeln
ielsen.org/blog/u... - Mannix, Numerical recipes in Hadoop: http://www.slides
hare.net/jakema... - Godsil, Algebraic Graph Theory: http://www.amazon.com/Alg
ebraic-... - Wheeler: On building a stupidly fast graph database: http://blog.dir
ectededge.com/200... - http://numpy.scipy.org/
2) Start learning statistics by coding with R:
- Pick up some R manuals (see
and UCI Machine learning repository: http://archiv
- Here is a good reference to get started with regression analysis:
http://www.amazon.com/Ana
- Albert, Bayesian computation with R:
- Spector, Data Manipulation with R:
- Gries, Quantitative corpus linguistics with R: http://www.amazon.com/
Quantitati... - Duda & Hart, Pattern Classification:http://www
.amazon.com/Pattern-Cl... , it is a classic book on statistical inference and a very readable intro to the field - Go through the Exploratory Data Analysis by Tukey: http://www.amazon.
com/Explorator.... Read Hamming for inspiration: http://www.c s.virginia.edu/~robi... - If you want to get a job look up "statistician" or "data scientist" job specs on Twitter and see what the market wants: http://twitter.com
/#search?q=sta..., http:/ /twitter.com/#search?q=%2 2... - E.g. here is Netflix's definition of "data scientist" body of knowledge: http://jobs.ne
tflix.com/DetailFl... Mul tivariate Regression, Logistic Regression, Support Vector Machines, Bagging, Boosting, Decision Trees, Time Series Analysis, Optimization, Stochastic Processes, Experiment Analysis, Bootstrapping, R, SAS, Python, Weka, SQL and Excel . This looks like a standard Statistics curriculum. - According to LinkedIn job posting (http://www.sanfranrecrui
ter.com/...) you need to know some of the following: algorithm design, information retrieval, relational databases (SQL) and non-relational databases (Hadoop/pig), big data analytics, data classification, text mining, search algorithms. This seems to be more of a CS/IR oriented role. - Learn about Palantir (http://www.palantirtech.
com/), Recorded Future (https://www.recordedfutu re.com/) and Lyric Semiconductor (http://www.lyricsemicond uctor.com/), they make interesting products. - Subscribe to DBWorld (it's a bit noisy but worth following): http://www.cs
.wisc.edu/dbworld/; Consi der joining at least one of these interest groups: http://www.sigkdd .org/, http://www.sigir.o rg/, http://www.sigmod.or g/, http://www.sigsam.org , http://www.amstat.org/, h ttp://www.siam.org/ - Choose an interesting problem to tackle, say temporal search: http://www.google
.com/search?q=t... - See what interests you more, do your market research. Would you prefer working with vendor tools and do mostly modeling and reporting, or build data mining systems yourself and write a lot of code? Do you see yourself as a corporate employee, a researcher in academia or a startup founder in the future? What data interests you? Structure your curriculum based on that.
3) Learn about distributed systems and databases:
- Note: this topic is not part of a standard Machine Learning track but you can probably find courses such as Distributed Systems or Parallel Programming in your CS/EE catalog. I believe it is important to learn how to work with a Linux cluster and how to design scalable distributed algorithms if you want to work with big data. It is also becoming increasingly important to be able to utilize the full power of multicore. (see http://en.wikipedia.
org/wiki/Moo... , http:// techresearch.intel.com/ar ...) - Download Hadoop [8] and run some MapReduce jobs on your laptop in pseudo-distributed mode (see
- Learn about Google technology stack (MapReduce, BigTable, Dremel, Pregel, GFS, Chubby, Protobuf etc). (See
also http://research.google.co
- Setup account with Amazon AWS/EC2/S3/EBS and experiment with running Hadoop on a cluster with large data sets (you can use Cloudera or YDN images, but in my opinion you can better understand the system if you set it up from scratch, using the original distribution). Watch the costs.
- Try out Hadoop alternatives, specifically the minimalist frameworks such as BashReduce: http://github
.com/erikfrey/bashr... an d CloudMapReduce: http://co de.google.com/p/cloudma.. . (see
- Run Bryan Cooper's Cloud Serving Benchmark on AWS, compare Hbase vs Cassandra performance on a small cluster (6-8 nodes): http://wiki.github.com/b
rianfran... - Run LINPACK benchmark: http://www.dat
awrangling.com/on-... - Run some experiments with MPI (http://www.mcs.anl.gov/r
esearch/...) try to implement a simple clustering algorithm (e.g http://en.wikipedia. org/wiki/K-m...) with MPI vs Hadoop/MapReduce and compare the performance, fault tolerance, ease of use etc. Learn the differences between the two approaches, and when it makes sense to use each one. - Check out Dongarra' papers: http://www.netlib
.org/utk/people... - There is a new library called MPI-Mapreduce (http://www.sandia.gov/~s
jplimp/m...) see how it works and how it compares to other MapReduce implementations - Run some tests with Scalapack [5], try to port one of the routines to Hadoop, compare the performance and scalability
- Write your own simplified MapReduce runtime in C or any other programming language
- Check out http://www.cascading.
org/, http://clojure.org/ and http://github.com/bra dford/infer - Learn about distributed hash tables (http://en.wikipedia.org/
wiki/Dis...) - Learn about Paxos (http://en.wikipedia.org/
wiki/Pax...), run some experiments with open source implementations. - Download Nutch (http://nutch.apache.org/
) or Solr (http://lucene.apache.org /solr/), run a crawl on Wikipedia. Analyze the collected data with R (see item 2 above) or Python (http://www.nltk.org/) - Write you own simplified crawler/indexer, test the performance and scalability, look at the Lucene source for ideas, look at http://infolab.stanfor
d.edu/~bac... for inspira tion. You can probably build it as a term project in either Information Retrieval or Search Engines course. - Learn about prefix-sum: http://en.wik
ipedia.org/wiki/Pre... ,parallel matrix multiplication: http://ww w.cs.berkeley.edu/~yeli.. . ,streaming: http://infola b.stanford.edu/stream/ and BSP: http://en.wikipedia. org/wiki/Bul... - Pick one of the PGAS languages (http://en.wikipedia.org/
wiki/Par...), e.g. X10 (http://en.wikipedia. org/wiki/X10..., go through the tutorials (http://ppppcourse.ning.c om/forum...), run some HPC benchmarks (LU, FFT) and the examples (the streaming example in particular): see how it scales on a cluster/AWS, compare to sequential and Hadoop/MapReduce implementation, see what kind of performance/scalability gains it gives you on multicore boxes. - Some good references on parallel programming: Herlihy& Shavit, The art of multiprocessor programming: http://www.amazon.com/Art
-Multip... , Blelloch, Vector models for data-parallel computing: http://citeseerx.ist.psu. edu/vie... , Valiant, A bridging model for parallel computation: http://portal.acm.org/cit ation.c... ,Hillis & Steele, Data Parallel Algorithms: http://portal .acm.org/citation.c... - Take a course in Parallel Computer Architecture: http://www.
eecs.berkeley.edu/~cu... - Check out Cilk: http://software.int
el.com/en-us/... - Run some experiments with Weka (http://www.cs.waikato.ac
.nz/ml/w...) or RapidMiner (http://rapid-i.com/), pick a simple algorithm and port it to MapReduce, see how it scales on a cluster/AWS - Experiment with distributed 'NoSQL' data stores (Voldemort, Hbase, Redis, Tokyo, Cassandra etc). Figure out what is CAP theorem all about (http://www.allthingsdist
ributed....). Create a simple app with key-value or column-based store as a back-end. Import several GBs of interesting data into it and run some simple clustering/KNN algos (http://en.wikipedia.org/ wiki/Clu..., http://en.wi kipedia.org/wiki/Nea...). Optimize your algo to better utilize random access patterns, experiment with various tuning options. Build a frond-end visualization for the results (Check out Protovis or similar visualization package: http://vis.stanf ord.edu/protovis/) - A good resource on 'NoSQL': Varley, No Relation: The Mixed Blessings of Non-Relational Databases: http://ianvarley.com/UT/M
R/Varle... - Learn about main-memory databases: http://en.wiki
pedia.org/wiki/In-... , h ttp://scholar.google.com/ schola..., http://monetdb .cwi.nl/ - Write a distributed hash table in C, here is a good reference: http://pdos.cs
ail.mit.edu/papers... - Write a distributed file system in C. Learn how to write good systems code using the following resources:
http://herpolhode.com/rob
http://www.cs.princeton.e
http://cm.bell-labs.com/w
http://www.cs.columbia.ed
http://plan9.bell-labs.co
http://www.informatik.uni
4) Learn about data compression
To be added
5) Learn about machine learning
- This is an excellent resource for self-study: Cross, Learning about machine learning: http://measuringmeasures.
com/blo... , also http://metaoptimize. com/qa/quest... - The alternative (and rather expensive) option is to enroll in a CS program/Machine Learning track if you prefer studying in a formal setting.
- Since all the standard machine learning, data mining, IR, statistics, AI, NLP content is available online, can be forked on github or purchased on Amazon I personally don't see much value in studying for a Masters degree unless you want a corporate job afterwards.
- See: Was your Master's in Computer Science (MS CS) degree worth it and why? , When is it a good idea to get an MS in Computer Science? , Was your Master's degree in Statistics/Applied Math/Symbolic systems worth it and why? What are the advantages and disadvantages of doing a CS PhD?
- [Higher Education] Which are the best universities for an MS or PhD related to Information Retrieval, and why?
- See Lorica, How to nurture data scientists: http://practi
calquant.blogspot.c... - You can structure your study program according to online course catalogs and curricula of MIT (http://web.mit.edu/catal
og/degre..., http://ocw.m it.edu/courses/elect...), Stanford (http://www.stanford.edu/ dept/reg...) or other top engineering schools. Experiment with data a lot, hack some code, ask questions, talk to good people, set up a web crawler in your garage (http://www.ngoprekweb.co m/2006/1...). - Joining a well-capitalized data-driven startup and learning by doing (with some part-time self-study using the resources above) could be a good option. See
Who are the best VCs in the field of analytics / data mining / databases?
Which companies have the best data science teams?
What are the notable startups in the news space?
Does the US Census have a data team?
Why do so many data geeks join web companies instead of solving large scale data problems in biology?
6) Learn about least-squares estimation and Kalman filters:
- This is a classic topic and "data science" par excellence in my opinion. It is also a good introduction to optimization and control theory. Start with Bierman's LLS tutorial given to his colleagues at JPL, it is clearly written and is inspiring (the Apollo mission trajectory was estimated using these methods): http://www.amaz
on.com/Factorizat... , also see Curkendall & Leondes: http://adsabs.harvard.edu /full/1974CeMec...8..481C and Quarles: http://citeseerx .ist.psu.edu/vie.... - See Steven Kay's series on statistical signal estimation: http://www.am
azon.com/Fundamenta..., also check out his short course outline at University of Rhode Island for a list of interesting topics to learn (this is usually part of EE curricula): http://www.el e.uri.edu/faculty/k...
7) Check out these Q&A:
What are the best blogs about data?
What are the best Twitter accounts about data?
What are the best blogs about bioinformatics?
What are the best Twitter accounts about bioinformatics?
What is data science?
What are the best courses at MIT?
What are the best resources to learn about web crawling and scraping?
What are the best interview questions to evaluate a machine learning researcher?
What are the best resources for learning about distributed file systems?
What are some useful packages for working with large datasets in R?
What are some good books on stringology and pattern matching?
What's a good introductory machine learning text?
What is the best book to pick up working knowledge of theoretical statistics (assuming strong general math)?
Can anyone recommend a fantastic book on time series analysis?
What are the standard texts on linear regression?
What are some good books on random processes?
How has BigTable evolved since the 2006 Google paper?
What is a good source for learning about Bayesian networks?
What are the best data visualizations ever created?
What are some of the prediction and risk estimation models used by insurance companies?
How do scientists share data?
What are the best quant hedge funds?
What are the best books on econometrics?
What are the best introductory books on mathematical finance?
What is the best approach for text categorization?
What are the numbers that every engineer should know, according to Jeff Dean?
If you do decide to go for a Masters degree:
8) Study Engineering - I'd go for CS with a focus on either IR or Machine Learning or a combination of both and take some systems courses along the way. As a "data scientist" you will have to write a ton of code and probably develop distributed algorithms/systems to process massive amounts of data. MS in Statistics will teach you how to do modeling and regression analysis etc, not how to build systems, I think the latter is more urgently needed these days as the old tools become obsolete with the avalanche of data. There is a shortage of engineers who can build a data mining system from the ground up. You can pick up statistics from books and experiments with R (see item 2 above) or take some statistics classes as a part of your CS studies.
Good luck.
[1] http://mahout.apache.
[2] http://www.netlib.org
[3] http://www.netlib.org
[4] http://math.nist.gov/
[5] http://www.netlib.org
[6] http://labs.google.co
[7] http://www.r-project.
[8] http://hadoop.apache.
Peter Skomoroch, Sr. Data Scientist @ Linkedin - ...
19 endorsements
If you have the time to take courses, give it a shot.
1) Try to take some of the undergrad math courses you missed. Linear Algebra, Advanced Calculus, Diff. Eq., Probability, Statistics are the most important. After that, take some Machine Learning courses. Read a few of the leading ML textbooks and keep up with journals to get a good sense of the field.
2) Read up on what the top data companies are doing. After 1 or 2 machine learning courses you should have enough background to follow most of the academic papers. Implement some of these algorithms on real data.
3) If you are working with large datasets, get familiar with the latest techniques & tools (Hadoop, NoSQL, R, etc.) by putting them into practice at work (or outside of work).
Read these posts by Mike Driscoll:
* http://dataspora.com/blog /the-se...
* http://dataspora.com/blog /sexy-d...
1) Try to take some of the undergrad math courses you missed. Linear Algebra, Advanced Calculus, Diff. Eq., Probability, Statistics are the most important. After that, take some Machine Learning courses. Read a few of the leading ML textbooks and keep up with journals to get a good sense of the field.
2) Read up on what the top data companies are doing. After 1 or 2 machine learning courses you should have enough background to follow most of the academic papers. Implement some of these algorithms on real data.
3) If you are working with large datasets, get familiar with the latest techniques & tools (Hadoop, NoSQL, R, etc.) by putting them into practice at work (or outside of work).
Read these posts by Mike Driscoll:
* http://dataspora.com/blog
* http://dataspora.com/blog
Joseph Misiti
6 votes by Charlie Cheever, Edwin Khoo, Mei Marker, (more)Charlie Cheever, Edwin Khoo, Mei Marker, Eric Toda, Alex Kamil and Faraz Syed
I am currently working as a data engineer with a team of others and I can tell you what we all have in common:
1) MS or PhDs in Applied Mathematics or Electrical Engineering
2) Fluency C++/Matlab/Python
3) Experience building distributed systems and algorithms.
I agree with Anon that CS is probably not the way to go unless you are going to MIT, Caltech, Stanford, CMU, etc. The way I ended up in the field was working as a software engineer designing real-time systems and getting a MS in Applied Math part-time. After 4 years I had skills from both fields and was offered a position doing ML/DM. With that said, I can tell you that its an extremely interesting field, and it appears the skill set will only become more desirable in the future.
1) MS or PhDs in Applied Mathematics or Electrical Engineering
2) Fluency C++/Matlab/Python
3) Experience building distributed systems and algorithms.
I agree with Anon that CS is probably not the way to go unless you are going to MIT, Caltech, Stanford, CMU, etc. The way I ended up in the field was working as a software engineer designing real-time systems and getting a MS in Applied Math part-time. After 4 years I had skills from both fields and was offered a position doing ML/DM. With that said, I can tell you that its an extremely interesting field, and it appears the skill set will only become more desirable in the future.
Gregory Piatetsky, analytics/data mining consultant...
1 endorsement
A good start for becoming a data scientist is to get MS (or PhD) in Machine Learning / Data Mining - along the way you will get plenty of experience in relevant math and use latest systems. Stanford, UCI, CMU, MIT are top schools, but there are many others in USA - see
http://www.kdnuggets.com/ educati... and in Europe
http://www.kdnuggets.com/ educati...
Stanford has online courses in data mining / ML - check
http://www.kdnuggets.com/ 2010/06...
http://scpd.stanford.edu/
http://www.kdnuggets.com/
http://www.kdnuggets.com/
Stanford has online courses in data mining / ML - check
http://www.kdnuggets.com/
http://scpd.stanford.edu/
Russell Jurney, Data Viznik, Hack Historian
2 endorsements
4 votes by Alex Kamil, Simplicio Gamboa III, Luis Alberto Santana and Mat Kelcey
The school route is well covered. This is the autodidactic route:
Look at some common problems solved with machine learning. Look at problems in your areas of interest with an abundance of available data. Intersect these sets, pick a problem to solve with ML. Learn whatever it takes to solve it poorly. Get people using the output of your model. Iterate, learn more techniques. Work on your maths as needed. Find mentors to talk with about problems you're working on. Keep them updated, collaborate, learn from them.
Get good at building things with data. Update your LinkedIn profile - congratulations, you're a data scientist!
Look at some common problems solved with machine learning. Look at problems in your areas of interest with an abundance of available data. Intersect these sets, pick a problem to solve with ML. Learn whatever it takes to solve it poorly. Get people using the output of your model. Iterate, learn more techniques. Work on your maths as needed. Find mentors to talk with about problems you're working on. Keep them updated, collaborate, learn from them.
Get good at building things with data. Update your LinkedIn profile - congratulations, you're a data scientist!
Paco Nathan, 45 years ago I couldn't even spe...
4 endorsements
4 votes by Joey Shurtleff, Edwin Khoo, Alex Kamil and Josh Wills
Stanford has an interdisciplinary degree specifically for data science, called Mathematical and Computational Sciences (MCS). It's sponsored by the Stats department and overlaps with CS, Math, Operations Research, etc.
http://www.stanford. edu/group/ma... The BS degree dovetails particularly well with a co-term program to get an MS in Computer Science -- say, with a distributed systems specialization.
+1 to both Pete's and Russ' wise words above.
+1 to both Pete's and Russ' wise words above.
Yaniv Goldenrand, Fraud and credit modeling
3 votes by Alex Kamil, Kevin Li and Seb Paquet
Get a job doing it, this way you'll learn what really matters and get paid in the process.
The standard way to become a data analyst is master's in math/statistics + internship.
Other ways are:
- PhD in some empirical subject (economics, psychology).
- Get an engineering position in some data-intensive company and convert.
Some of the best modelers I know are ex-programmers.
The standard way to become a data analyst is master's in math/statistics + internship.
Other ways are:
- PhD in some empirical subject (economics, psychology).
- Get an engineering position in some data-intensive company and convert.
Some of the best modelers I know are ex-programmers.
Sandro Saitta
1 vote by Alex Kamil
Reading data mining related blogs is also important to understand the wide application areas of data mining. You have a list of data mining blogs here:
http://www.dataminingblog .com/li...
转自:http://www.cnblogs.com/sxfmol/archive/2010/09/27/1836806
2019-03-27 00:48
知识点
相关教程
更多Spring Data: a new perspective of data operations
Spring Data: a new perspective of data operations Spring Data is an umbrella project from SpringSource Community, which tries to provide a more generic abstraction of data operations for RDBMS, NoSQL
zz Data Analysis Process
An interesting article....easy to understand. Summary, be critical..... MindMap Chart Below... By Robert Niles You wouldn't buy a car or a house without asking some questions about it first. So don't
(二)solr data import
solr 的 data import 导入 mysql数据 (1)、编辑 example/solr/conf/solrconfig.xml 添加 request handler <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandle
《Big Data Glossary》笔记
清明假期翻以前的笔记发现有一些NoSQL相关的内容,比较零散,是之前读《Big Data Glossary》的笔记.简单整理了一下,记录于此. Horizontal or Vertical Scaling 数据库扩展的方向有两个: 垂直扩展-换更牛的机器 水平扩展-增加同样的机器 选择水平扩展必然遇到的一个问题就是,如何决定数据分布在哪台机器上? 也就是分片策略
solr开发——spring-data-solr
spring貌似要一统天下,不断的给人以惊喜 请看官方发言: 我很高兴的宣布 Spring Data Solr 项目首个里程碑发布,这是由 Christoph Strobl 领导开发的项目,实现了 Spring Data 访问 Solr 存储并提供了 Spring Data JPA 模型的访问方式。此次之外,Spring Data Solr 提供了一个更底层的 SolrTempla
[转]So You Want To Be A Producer
pro-du-cer n. 1. Someone from a game publisher who will be the liaison between the publisher and the game development team. 2. A furnace that manufactures producer gas. If you want to learn about fur
Spring Data Solr教程(翻译)
大多数应用都必须具有某种搜索功能,问题是搜索功能往往是巨大的资源消耗并且它们由于沉重的数据库加载而拖垮你的应用的性能 这就是为什么转移负载到一个外部的搜索服务器是一个不错的主意,Apache Solr是一个流行的开源搜索服务器,它通过使用类似REST的HTTP API,这就确保你能从几乎任何编程语言来使用solr 虽然支持任何编程语言的能力具有很大的市场价值,你可能感兴趣的问题是:我如何和在我的S
自己封装的一个Solr Data Import Request Handler Scheduler
经过将近一天的努力,终于搞定了Solr的Data Import Request HandlerScheduler。 Scheduler主要解决两个问题: 1.定时增量更新索引。 2.定时重做索引。 经过测试,Scheduler已经可以实现完全基于配置,无需开发功能,无需人工干预的情况下实现以上两个功能(结合 Solr 的Data Import Request Handler前提下)。 为了方便以后
使用Solr Data Import的delta-import功能
使用Solr Data Import的delta-import功能 Solr提供了full-import和delta-import两种导入方式,这篇文章主要讲解后者。 所谓delta-import主要是对于数据库(也可能是文件等等)中增加或者被修改的字段进行导入。主要原理是利用率每次我们进行import的时候在solr.home\conf下面生成的dataimport.properties文
Book Report: THE SOUL OF A NEW MACHINE
Book Report: THE SOUL OF A NEW MACHINE Zhengdong Zhang What’s a good man in a storm like? The prologue of the book draws a astonishing picture. The ship is winding in the billows, and the storm is how
Solr: a custom Search RequestHandler
As you know, I've been playing with Solr lately, trying to see how feasible it would be to customize it for our needs. We have been a Lucene shop for a while, and we've built our own search framework
How to Start a Business in 10 Days
With an executive staffing venture about to open, a business loan from the in-laws gnawing at her conscience and a new baby to care for, Michelle Fish was already feeling the pressure. But what really
A Great List of Windows Tools
Windowsis an extremely effective and a an efficient operating system. Like any other operating systemwindowstoo needs a good set of development tools which are extremely necessary for the operating sy
Create a Bootable MicroSD Card
http://gumstix.org/create-a-bootable-microsd-card.html Create a Bootable MicroSD Card Beginners Note: The following instructions are intended for experienced Gumstix us
TMF大数据分析指南 Unleashing Business Value in Big Data(一)
大数据分析指南 TMF Frameworx最佳实践 Unleashing Business Value in Big Data 前言 此文节选自TMF Big Data Analytics Guidebook。 TMF文档版权信息 Copyright © TeleManagement Forum 2013. All Rights Reserved. This docume
最新教程
更多java线程状态详解(6种)
java线程类为:java.lang.Thread,其实现java.lang.Runnable接口。 线程在运行过程中有6种状态,分别如下: NEW:初始状态,线程被构建,但是还没有调用start()方法 RUNNABLE:运行状态,Java线程将操作系统中的就绪和运行两种状态统称为“运行状态” BLOCK:阻塞状态,表示线程阻塞
redis从库只读设置-redis集群管理
默认情况下redis数据库充当slave角色时是只读的不能进行写操作,如果写入,会提示以下错误:READONLY You can't write against a read only slave. 127.0.0.1:6382> set k3 111 (error) READONLY You can't write against a read only slave. 如果你要开启从库
Netty环境配置
netty是一个java事件驱动的网络通信框架,也就是一个jar包,只要在项目里引用即可。
Netty基于流的传输处理
在TCP/IP的基于流的传输中,接收的数据被存储到套接字接收缓冲器中。不幸的是,基于流的传输的缓冲器不是分组的队列,而是字节的队列。 这意味着,即使将两个消息作为两个独立的数据包发送,操作系统也不会将它们视为两个消息,而只是一组字节(有点悲剧)。 因此,不能保证读的是您在远程定入的行数据
Netty入门实例-使用POJO代替ByteBuf
使用TIME协议的客户端和服务器示例,让它们使用POJO来代替原来的ByteBuf。
Netty入门实例-时间服务器
Netty中服务器和客户端之间最大的和唯一的区别是使用了不同的Bootstrap和Channel实现
Netty入门实例-编写服务器端程序
channelRead()处理程序方法实现如下
Netty开发环境配置
最新版本的Netty 4.x和JDK 1.6及更高版本
电商平台数据库设计
电商平台数据库表设计:商品分类表、商品信息表、品牌表、商品属性表、商品属性扩展表、规格表、规格扩展表
HttpClient 上传文件
我们使用MultipartEntityBuilder创建一个HttpEntity。 当创建构建器时,添加一个二进制体 - 包含将要上传的文件以及一个文本正文。 接下来,使用RequestBuilder创建一个HTTP请求,并分配先前创建的HttpEntity。
MongoDB常用命令
查看当前使用的数据库 > db test 切换数据库 > use foobar switched to db foobar 插入文档 > post={"title":"领悟书生","content":"这是一个分享教程的网站","date":new
快速了解MongoDB【基本概念与体系结构】
什么是MongoDB MongoDB is a general purpose, document-based, distributed database built for modern application developers and for the cloud era. MongoDB是一个基于分布式文件存储的数据库。由C++语言编写。旨在为WEB应用提供可扩展的高性能数据存储解决方案。
windows系统安装MongoDB
安装 下载MongoDB的安装包:mongodb-win32-x86_64-2008plus-ssl-3.2.10-signed.msi,按照提示步骤安装即可。 安装完成后,软件会安装在C:\Program Files\MongoDB 目录中 我们要启动的服务程序就是C:\Program Files\MongoDB\Server\3.2\bin目录下的mongod.exe,为了方便我们每次启动,我
Spring boot整合MyBatis-Plus 之二:增删改查
基于上一篇springboot整合MyBatis-Plus之后,实现简单的增删改查 创建实体类 添加表注解TableName和主键注解TableId import com.baomidou.mybatisplus.annotations.TableId; import com.baomidou.mybatisplus.annotations.TableName; import com.baom
分布式ID生成器【snowflake雪花算法】
基于snowflake雪花算法分布式ID生成器 snowflake雪花算法分布式ID生成器几大特点: 41bit的时间戳可以支持该算法使用到2082年 10bit的工作机器id可以支持1024台机器 序列号支持1毫秒产生4096个自增序列id 整体上按照时间自增排序 整个分布式系统内不会产生ID碰撞 每秒能够产生26万ID左右 Twitter的 Snowflake分布式ID生成器的JAVA实现方案