To GitHub or not to GitHub?


I wanted to write about open source for a while. Then I saw in several bioinformatics job postings an experience with open source software being listed among desired skills — something I never saw before. Then I bumped into an article in a paper version of Computerworld “Saying Yes to open source” and the following: Netflix is built on open source — since its so low priced service; Kaiser Permanente has been using GitHub since 2011.

Well, what is open source? Wikipedia says: “open source is a development model that promotes a) universal access via free license to a product’s design or blueprint, and b) universal redistribution of that design or blueprint, including subsequent improvements to it by anyone” and adds that Wikipedia is an example of applying open source outside the realm of computer software. Inside, the best example of an open source software would be Red Hut Enterprise Linux; while Unix, which I for example use on my MacBook, is not open source. RStudio is an open source IDE for R programming language. Eclipse, one of my favorite IDEs for whatever language, is also open source allowing to anyone to edit existing and to  write new plug-ins. A lot of commonly used bioinformatics software  is open source: Jmol and RasMol viewers, BioPython, BioJava, Bioconductor (and other R packages in CRAN), Galaxy — a workflow web-based system for analysis of genomics data and integration of data from UCSC Genome Browser and user data, and many other.

On another side of this free open source paradise is code hosting web repositories. And while many of them have been set up in the last 15 years, only the oldest, SourceForge (hosts more than 430,000 projects and has more than 3.7 million registered users, by Wikipedia), and the youngest, GitHub (hosting over 11.7 million repositories makes it the largest code host in the world), are the two everyone can name on-the-spot.

GitHub seems to be one to go with, for now. I first encountered GitHub last summer when my summer high school students refused to use SVN and sent me a link to GitHub. Now they have something to show up to the world :). We haven’t embraced GitHub at work yet, while are considering to use it for showing up the code (it is not free for private developments). While according to some posts it is worthwhile to pay and transition to GitHub for private developments as well.

The platform of tomorrow is the Web


While some still argue if Chromebook is a legitimate laptop, Auberge Resorts (the chain of seven luxury hotels) is migrating to Chromebooks running Google’s cloud-based services, — reports Computerworld. It is a bold move, considering that the Chrome OS-based laptops accounted for just 1% of the world PC market in 2013. The main reason? The price: Chromebook can be bought for as little as $200 — a great deal for those who are mostly the MS Office’s users. And Microsoft is going to offer its free Office online apps — Word, Excel, PowerPoint and OneNote — to Google’s Chrome Web Store.

Companies to watch: DNAnexus


This is an excerpt from the job ad at DNAnexus:

At DNAnexus we are solving the most challenging computer science problems you’re ever likely to see. 
In the last few years, there has been a dramatic development in the world of genomics that has created a huge new opportunity. The price to sequence the full human genome (all of your DNA, not just a sample of it) has fallen to the point were it will soon be affordable for a patient to have multiple samples of their whole genome sequenced to help treat their disease. Want to know what specific gene mutation caused a patient’s cancer? We are building the platform to answer that kind of question. One of the many challenges is the huge amount of data. Think you’ve seen big-data problems? Think again – with each genome comprising 100 GB and months of CPU time to crunch the information, DNA is the next big-data problem, requiring exabytes of storage and parallel workloads distributed across 100,000 servers. We are tackling this by combining web technologies, big-data analytics, and scalable systems on cloud computing infrastructure. 

We are a well-funded start-up backed by Google Ventures, TPG Biotech, and First Round capital. Our founders, Andreas Sundquist, Arend Sidow, Serafim Batzoglou are world-renowned genomics and bioinformatics experts from Stanford University.

We are looking for smart motivated people. Ideal candidates will likely know several of the following technologies: C, C++, Boost, Ruby, Rails, HAML, HTML, CSS, JavaScript, ECMAScript, V8, jQuery, Flash, Flex/ActionScript, Node.js, Python, Perl, PHP, Amazon Web Services (AWS), SQL, MySQL, PostgreSQL, MongoDB, Solr, Sphinx, Hadoop, MapReduce, ZooKeeper, Hive, Pig, Oozie, HDFS, ZFS, MCache. 

And some more about the company from the company’s website:

No proposal is considered too large, as demonstrated through the CHARGE project, a collaboration between Baylor College of Medicine’s Human Genome Sequencing Center (HGSC), DNAnexus, and Amazon Web Services that enabled the largest genomic analysis project to have ever taken place in the cloud. We worked with HGSC to deploy its Mercury variant-calling pipeline on the DNAnexus cloud platform, processing 3,751 whole genomes and 10,940 exomes using 3.3 million core-hours with a peak of 20,800 cores and making Mercury available to 300 researchers participating in the global CHARGE consortium.

Here is the service model and pricing DNAnexus provides, effective May 2014:



20 cool infographics on healthcare

Here are 20 infographics related to healthcare and made in 2012. Yes, two years old. Yet some of them pretty cool in terms of design and statistics that can be borrowed for presentations, and some are made to be actual in the long run. In any case, I found that gallery of designs and numbers inspiring.

I liked #5 — HOW BIG DATA FLOWS IN HEALTHCARE — it even shows the table what petabyte and exabyte are (handy 🙂 ) and where data are generated, flow and can be potentially used.

#2 — The Future of Healthcare Technology Over A 30 Year Span — would be curious to look at in a few years.

#4 – How patients learn in the digital age from HealthEdAcademy — contains a curious phrase “Healthcare extenders becoming the search engines for patients”. 59% of healthcare extenders say that patients bring information from the Internet to discuss , and 30% say that patients cannot find reputable information on the Internet.

#6 provides data from surveying 3,015 U.S. practicing physicians. 62% of them owned a tablet, and half of tablet-owners used it at the point-of-care. On average, physicians spent 11 hours per week online for professional purposes, and oncologists 17 hours.

#9 lists top 20 most popular EMR solutions and major players on this market in 2013: eClinicalWorks, McKesson, Cerner, Epic, Allscripts, etc.

#15 from provides that “more than 40% of consumers say that information found via social media affects the way they deal with their health.”

#17 from gives promising numbers on using Cloud in health care.

I got Stat Learning Certificate from Stanford!


That was the best MOOC I took so far — fun and engaging. Statistics and R from Trevor and Rob and Stanford — a dream coming true!

Humans can discriminate 1+ trillion senses

C. Bushdid et al. Science

C. Bushdid et al. Science

This is far more than previous estimates of distinguishable olfactory stimuli, or 10,000 odors, – reports this Science paper. The approach is fascinating in itself, as the Science editor summarized:

“Because the authors reduced the complexity by investigating only mixtures of 10, 20, or 30 components drawn from a collection of 128 odorous molecules, this astonishingly large number is probably the lower limit of the potential number of olfactory stimuli that humans can distinguish.”

I have been always fascinated by smells as I like perfume and have a good olfactory memory. Smells, or odorants (odorous molecules), are detected by binding to specialized olfactory receptor neurons lining the nose in the olfactory epithelium, which is about 10 squared cm in human and contains millions of neurons. Each neuron expresses only one functional odor receptor, expressed by a single gene; humans have 350 such genes.

“Odor sensation depends on the concentration (number of molecules) available to the olfactory receptors. A single odorant stimulus type is typically recognized by multiple receptors, and different odorants are recognized by combinations of receptors, the patterns of neuron signals helping to identify the smell. The olfactory system does not interpret a single compound, but instead the whole odorous mix, not necessarily corresponding to concentration or intensity of any single constituent.” (Wikipedia) There are different theories how exactly the stimulation happens.

Out of the 117 known chemical elements, only 87 can form chemical compounds, estimated number of which is from 10^18 to 10^200 (for comparison the number of grains of sand on the Earth is about 7.5 x 10^18, the number of particles in the universe is between 10^72 and 10^87) (the numbers are from wisegeek). Aromatic compounds are stable and abundant in both natural and synthetic forms. But how many of them are known? Surprisingly difficult to come up with a complete picture searching Google.


Competition to design a robot delivering a TED talk


The X Prize Foundation announced the competition to design a robot delivering a TED talk. Everyone is welcome to propose the rules for the competition, which will tentatively give a robot 30 minutes to prepare a 3 minute talk on one of 100 TED talk subjects.

There is a skepticism about the competition. As NewScientist writes:

“Computer scientist Ryan Adams at Harvard University says that such a set-up would reveal little about AI. “Intelligence involves adapting, learning about the structure of the world, making decisions under uncertainty and achieving objectives over time,” he says. “Giving a talk and then ‘answering questions’ doesn’t tell us anything about any of these issues.”

I am skeptical too. First, I naively thought about what kind of data besides the TED API, which provides access to more than 1,000 talks and TED Quotes, tags, themes, ratings and more, could be pulled for the task. But after contemplating on the TED talks (and I watch them pretty regularly) and how often an exciting talk delivers too little actual information (just try once in a while to read transcripts of your favorite talks), I thought that it is not that much of a challenge to make a 3 minute TED presentation, considering that both you and your robot competitors won’t have any credentials.

The recipe might be simple: tell an interesting and emotional story to connect to an audience, ask them to raise hands or stand up, throw some bits of data to feed the brains, and finish going back to your story concluding with a lesson or two. The audience will be delighted. Do you need any AI for that? I doubt. Just upload 100 stories, nice speech recognition voice, some facts from Wikipedia and, with the Internet connection on the spot, grab something what the audience hasn’t yet read on the day of presentation. Appearance matters too.