微信搜索“高速商务通”,马上办理ETC
第1章 欢迎来到未来
The shovel is a tool, and so is a bulldozer.
铲子是一种工具,推土机也是如此。
Neither works on its own, “automating” the task of digging.
它们都不能自主工作,比如“自动化”完成挖掘任务。
But both tools augment our ability to dig.
但这两种工具都增强了我们的挖掘能力。
Dr. Douglas Engelbart, “Improving Our Ability to Improve”1
Douglas Engelbart博士,“提高我们的改进能力”
arketing is about to get weird.
营销即将变得有意思。
We’ve become used to an ever-increasing rate of change.
我们已经习惯了持续的改变。
But occasionally, we have to catch our breath, take a new sighting, and reset our course.
但偶尔,我们也应该屏住呼吸,审视一下,重新规划我们的路线。
M
Between the time my grandfather was born in 1899 and his seventh birthday:
在我祖父1899年出生后和他七岁生日之间:
Theodore Roosevelt took over as president from William McKinley.
西奥多罗斯福接任威廉麦金利成为美国总统。
Dr. Henry A. Rowland of Johns Hopkins University announced a theory about the cause of the Earth’s magnetism.
约翰斯·霍普金斯大学的Henry A. Rowland博士宣布了一个引发地球磁性原因的理论。
L. Frank Baum’s The Wonderful Wizard of Oz was published in Chicago.
L. Frank Baum的“绿色奇妙的巫师”在芝加哥出版。
The first zeppelin flight was carried out over Lake Constance near Friedrichshafen, Germany.
第一架齐柏林飞艇在德国腓特烈港附近的康斯坦茨湖上空飞行。
Karl Landsteiner developed a system of blood typing.
Karl Landsteiner发明了一种血型分类方法。
The Ford Motor Company produced its first car—the Ford Model A.
福特汽车公司生产了他们的第一辆汽车 – 福特A型轿车。
Thomas Edison invented the nickel-alkaline storage battery.
托马斯爱迪生发明了镍碱蓄电池。
The first electric typewriter was invented by George Canfield Blickensderfer of Erie, Pennsylvania.
宾夕法尼亚州伊利市的George Canfield Blickensderfer发明了第一台电动打印机。
The first radio that successfully received a radio transmission was developed by Guglielmo Marconi.
Guglielmo Marconi发明了第一台可以接无线信号的无线电。
The Wright brothers flew at Kitty Hawk.
莱特兄弟在小鹰号上飞翔。
The Panama Canal was under construction.
巴拿马运河还在建设当中。
Benjamin Holt invented one of the first practical continuous tracks for use in tractors and tanks.
Benjamin Holt发明了用于拖拉机和坦克的第一个实用连续轨道。
The Victor Talking Machine Company released the Victrola.
Victor Talking Machine公司发布了Victrola。
The Autochrome Lumière, patented in 1903, became the first commercial color photography process.
AutochromeLumière于1903年获得专利,成为第一个商业彩色摄影工艺。
My grandfather then lived to see men walk on the moon.
我的祖父一直活到看到人类在月球上行走。
In the next few decades, we will see:
在接下来的几十年里,我们将看到:
Self-driving cars replace personally owned transportation.
自动驾驶汽车取代了个人驾驶的运输工具。
Doctors routinely operate remote, robotic surgery devices.
医生经常远程操控机器人手术设备。
Implantable communication devices replace mobile phones.
植入式通信设备取代了移动电话。
In-eye augmented reality become normalized.
眼内增强现实技术变得普遍。
Maglev elevators travel sideways and transform building shapes.
磁悬浮电梯可以横向移动,因而改变了建筑的形态。
Every surface consume light for energy and act as a display.
每个物体表面可以吸收光的能量并充当显示器。
Mind-controlled prosthetics with tactile skin interfaces become mainstream.
具有触觉皮肤感觉的精神控制假肢成为主流。
Quantum computing make today’s systems microscopic.
量子计算使今天的系统变得微不足道。
3-D printers allow for instant delivery of goods.
3-D打印机可以让我们即时交付货物。
Style-selective, nanotech clothing continuously clean itself.
风格可变的纳米技术服装可以不断清洁自己。
And today’s youngsters will live to see a colony on Mars.
今天的年轻人将活着去体验火星上的殖民地。
It’s no surprise that computational systems will manage more tasks in advertising and marketing.
很显然,计算系统将在广告和营销活动中发挥更大的作用。
Yes, we have lots of technology for mar- keting, but the next step into artificial intelligence and machine learn- ing will be different.
是的,我们现在已经有了很多用于营销的新技术,但人工智能和机器学习的影响会更加深远。
Rather than being an ever-larger confusion of rules-based programs, operating faster than the eye can see, AI systems will operate more inscrutably than the human mind can fathom.
人工智能系统的运作高深莫测,超出了人类的想象。它不是由规则而引发的越来越多的混乱,或者以超出人眼观察的速度运转。
3
WELCOME TO AUTONOMIC MARKETING
欢迎来到自主营销
The autonomic nervous system controls everything you don’t have to think about: your heart, your breathing, your digestion.
自主神经系统控制着你不必主动考虑的一切:你的心脏跳动,呼吸和消化。
All of these things can happen while you’re asleep or unconscious.
无论你睡着或昏迷的时候,所有这些动作都会自主发生。
These tasks are complex, interrelated, and vital.
这些动作任务很复杂,相互关联,却至关重要。
They are so necessary they must func- tion continuously without the need for deliberate thought.
它们是如此必要,以致它们必须连续运作而不需要经过大脑的深思熟虑。
That’s where marketing is headed.
这就是市场营销的发展方向。
We are on the verge of the need for autonomic responses just to stay afloat.
我们需要主动适应才能保持活力。
Personalization, recom- mendations, dynamic content selection, and dynamic display styles are all going to be table stakes.
个性化,推荐,动态内容选择和动态展示风格都将成为必备选项。
The technologies seeing the light of day in the second decade of the twenty-first century will be made available as services and any com- pany not using them will suffer the same fate as those that decided not to avail themselves of word processing, database management, or Internet marketing.
在二十一世纪的第二个十年里这些技术技术应用前景很光明,任何不使用它们的公司将遭受与那些决定不利用文字处理,数据库管理的公司相同的命运。网络营销也一样。
And so, it’s time to open up that black box full of mumbo-jumbo called artificial intelligence and understand it just well enough to make the most of it for marketing.
因此,现在是时候打开那个充满了名为arti fiial cial智慧的黑匣子,并对它充分了解,以便很好的利用它进行营销。
Ignorance is no excuse.
无知不是借口。
You should be comfortable enough with artificial intelligence to put it to practical use without having to get a degree in data science.
无需获得数据科学领域的学位就能将人工智能投入实际应用中,对此你应该感到很开心。
WELCOME TO ARTIFICIAL INTELLIGENCE FOR MARKETERS
欢迎来到人工智能营销
It is of the highest importance in the art of detection to be able to recognize, out of a number of facts, which are incidental and which vital.
在侦探行业,最重要的是能够从许多事实中识别出偶然事件和重要的事件。
Sherlock Holmes, The Reigate Squires
夏洛克·福尔摩斯,Reigate Squires
This book looks at some current buzzwords to make just enough sense for regular marketing folk to understand what’s going on.
本书专注于一些当前的流行知识,以便常规营销人员能够理解正在发生的事情。
This is no deep exposé on the dark arts of artificial intelligence.
本书并不打算深刻揭示人工智能的黑暗艺术。
This is no textbook for learning a new type of programming.
也不是学习新的编程技术的教科书。
This is no exhaustive catalog of cutting-edge technologies.
也不是前沿技术的详细目录。
This book is not for those with advanced math degrees or those who wish to become data scientists.
本书不适合获得过高等数学学位的人或希望成为数据科学家的人。
If, however, you are inspired to delve into the bottomless realm of modern systems building, I’ll point you to “How to Get the Best Deep Learning Education for Free”2 and be happy to take the credit for inspiring you.
然而,如果您受到启发,进而深入的去研究AI,我将给您一些指导,帮你了解“如何免费获得最好的深度学习知识”并乐于给您一些激励。
But that is not my intent.
但这不是我的本来的目的。
4
You will not find passages like the following in this book:
您不会在本书中找到如下的内容:
Monte-Carlo simulations are used in many contexts: to produce high quality pseudo-random numbers, in complex settings such as multi-layer spatio-temporal hierarchical Bayesian models, to estimate parameters, to compute statistics associated with very rare events, or even to generate large amount of data (for instance cross and auto-correlated time series) to test and compare various algorithms, especially for stock trading or in engineering.
蒙特卡罗模拟在多种情况下的应用:在复杂的设置中产生高质量的伪随机数,例如多层时空分层贝叶斯模型,参数估计,计算与极少数事件相关的统计数据,甚至是生成大量数据(例如交叉和自相关时间序列)以测试和比较各种算法,特别是股票交易或工程领域。
“24 Uses of Statistical Modeling” (Part II)3
“24种统计建模的使用”(第二部分)
You will find explanations such as:
您将找到这样的解释:
Artificial intelligence is valuable because it was designed to deal in gray areas rather than crank out statistical charts and graphs.
人工智能非常有价值,因为它旨在解决灰色区域的问题,而不是制作统计图表和图表。
It is capable, over time, of understanding context.
随着时间的推移,人工智能可以理解背景信息。
The purpose of this tome is to be a primer, an introduction, a statement of understanding for those who have regular jobs in marketing—and would like to keep them in the foreseeable future.
这本书的目的是为那些在市场营销行业中拿着固定工资的人提供一种入门读物,简单介绍,帮他们理解人工智能,以便他们在未来可以继续待在市场营销行业。
Let’s start with a super-simple comparison between artificial intel- ligence and machine learning from Avinash Kaushik, digital marketing evangelist at Google:
让我们从Google数字营销传播者Avinash Kaushik对人工智能和机器学习简单比较开始:
“AI is an intelligent machine and ML is the ability to learn without being explicitly programmed.”
“AI是一种智能机器,机器学习是一种无需明确写好程序即可进行学习的能力。”
Artificial intelligence is a machine pretending to be a human.
人工智能是一种假装是人的机器。
Machine learning is a machine pretending to be a statistical program- mer.
机器学习是一种假装成统计学家的机器。
Managing either one requires a data scientist.
管理任何一个都需要一个数据科学家。
An ever-so-slightly deeper definition comes from E. Fredkin University professor at the Carnegie Mellon University Tom Mitchell:4
卡内基梅隆大学的E. Fredkin大学教授汤姆米切尔给出了一个更为深刻的定义:
The field of Machine Learning seeks to answer the question, “How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?”
机器学习旨在回答这样一个问题:“我们如何构建能够基于经验自动改进的计算机系统,以及控制所有学习过程的基本法则是什么?”
A machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E. Depending on how we specify
如果系统可以基于经验知识E,可靠的改善任务T的表现指标P,那么这个机器就会在E,T,P这个学习任务中学到东西。
T, P, and E, the learning task might also be called by names such as data mining, autonomous discovery, database updating, programming by example, etc.
基于我们的指令,这类学习任务可能是数据挖掘,自主发现,数据库更新,示例编程等等。
Machine learning is a computer’s way of using a given data set to figure out how to perform a specific function through trial and error.
机器学习是一种计算方式,它基于给定数据集反复的执行特定功能。
5
What is a specific function?
什么是特定功能?
A simple example is deciding the best e-mail subject line for people who used certain search terms to find your website, their behavior on your website, and their subsequent responses (or lack thereof) to your e-mails.
一个简单的例子就是拟定促销邮件的主题,需要参考这些信息:他们使用了哪些搜索词来找到您的网站,用户在您网站上的行为、用户对您的电子邮件的回复信息(或者不回复)
The machine looks at previous results, formulates a conclusion, and then waits for the results of a test of its hypothesis.
机器检查先前的结果,拟定一个假设,然后获取假设的测试结果。
The machine next consumes those test results and updates its weighting factors from which it suggests alternative subject lines—over and over.
接下来,该机器会从这些测试结果学习并更新某些因素的权重,给出信的邮件主题,不断的重复这个过程。
There is no final answer because reality is messy and ever changing.
没有最终的答案,因为现实是复杂的、不断变化的。
So, just like humans, the machine is always accepting new input to formulate its judgments.
因此,就像人类一样,机器总是接受新的输入并修正成其判断。
It’s learning.
机器在学习。
The “three Ds” of artificial intelligence are that it can detect, decide, and develop.
人工智能的“three Ds”是它能够发现(detect),决定(decide)和发展(develop)。
Detect
发现
AI can discover which elements or attributes in a subject matter domain are the most predictive.
AI可以发现主题域中的哪些元素或属性是最具预测性的。
Even with a great deal of noisy data and a large variety of data types, it can identify the most revealing characteristics, figuring out which to heed to and which to ignore.
即使现实中有大量噪音数据和各种不同的数据类型,它也可以识别出最有解释性的特征,找到哪些是要关注的,哪些是可以忽略掉的。
Decide
决定
AI can infer rules about data, from the data, and weigh the most pre- dictive attributes against each other to make a decision.
AI可以从数据中推断出有关数据的规则,并根据彼此的关系找到预测的方法来做出决策。
It can take an enormous number of characteristics into consideration, ponder the relevance of each, and reach a conclusion.
它可以同时纳入到大量的数据特征,思考每个特征的相关性,并得出结论。
Develop
发展
AI can grow and mature with each iteration.
AI可以随着每次迭代而成长和成熟。
Whether it is consider- ing new information or the results of experimentation, it can alter its opinion about the environment as well as how it evaluates that envi- ronment.
无论是考虑新的信息还是实验结果时,它都可以改变其对外部环境的看法以及评估外部环境的方式。
It can program itself.
它可以自我编程。
WHOM IS THIS BOOK FOR?
这本书是给谁看的?
This is the sort of book data scientists should buy for their marketing colleagues to help them understand what goes on in the data science department.
这是数据科学家应该为他们的营销同事购买的书籍,以帮助同事们了解数据科学领域的发展情况。
6
This is the sort of book marketing professionals should buy for their data scientists to help them understand what goes on in the marketing department.
这是营销专业人士应该为他们的数据科学家购买的书籍营销,以帮助科学家了解营销部门的发展情况。
This book is for the marketing manager who has to respond to the C-level insistence that the marketing department “get with the times” (management by in-flight magazine).
本书为这类市场营销经理准备的,他们坚持营销部门应该“与时俱进”(“杂志”杂志管理)。
This book is for the marketing manager who has finally become comfortable with analytics as a concept, and learned how to become a dexterous consumer of analytics outputs, but must now face a new educational learning curve.
本书面向营销经理,这些人最终乐意接受数据分析,乐意成为掌握数据分析的人,但他们又必须适应新的学习曲线。
This book is for the rest of us who need to understand the big, broad brushstrokes of this new type of data processing in order to understand where we are headed in business.
本书面向这些人,他们需要了解这种新的数据处理的技巧,以便了解业务的未来发展方向。
This book is for those of us who need to survive even though we are not data scientists, algorithm magicians, or predictive analytics statisticians.
本书适合那些需要了解更多的人,即使我们不是数据科学家,算法专家或预测分析统计学家。
We must get a firm grasp on artificial intelligence because it will be our jobs to make use of it in ways that raise revenue, lower costs, increase customer satisfaction, and improve organizational capabilities.
我们必须掌握人工智能,因为我们的工作就是利用它来提高收入,降低成本,提高客户满意度和提高组织能力。
THE BRIGHT, BRIGHT FUTURE
光明的未来
Artificial intelligence will give you the ability to match information about your product with the information your prospective buyers need at the moment and in a format they are most likely to consume it most effectively.
人工智能将为你赋能,让您的产品信息在合适的时候与买家需要的信息需求相匹配,并让消费者以最有效的方式使用你的产品。
I came across my first seemingly self-learning computer system when I was selling Apple II computers in a retail store in Santa Barbara in 1980.
1980年,当我在圣巴巴拉的一家零售商店销售Apple II电脑时,我遇到了第一个似乎有自学能力的电脑。
Since then, I’ve been fascinated by how computers can be useful in life and work.
从那以后,我一直对计算机如何在生活和工作中发挥作用感兴趣。
I was so interested, in fact, that I ended up explaining (and selling) computers to companies that had never had one before, and programming tools to software engineers, and consult- ing to the world’s largest corporations on how to improve their digital relationships with customers through analytics.
事实上,我的兴致如此之大,以致最终我加入了这一行业:向那些从没有买过电脑的公司销售电脑,给软件工程师提供编程工具,并向世界上最大的公司提供咨询服务,帮助他们改善公司与顾客的数字联系。
Machine learning offers so much power and so much opportu- nity that we’re in the same place we were with personal computers in 1980, the Internet in 1993, and e-commerce when Amazon.com began taking over e-commerce.
机器学习提供了如此强大的功能和机遇,以至于我们又面临了这样的局面:1980年的个人电脑时代,1993年的互联网时代以及亚马逊开始接管电子商务的时代。
http://Amazon.com/
http://Amazon.com/
In each case, the promise was enormous and the possibilities were endless.
在每种情况下,潜力都是巨大的,可能性是无穷无尽的。
Those who understood the impact could take advantage of it before their competitors.
提前了解这些趋势的人会领先于他们的竞争对手。
But the advantage was fuzzy, the implications were diverse, and speculations were off the chart.
但这种优势不是明显的,影响也是多样化的,并且猜测没有用。
7
The same is true of AI today.
今天人工智能也是如此。
We know it’s powerful and we know it’s going to open doors we had not anticipated.
我们知道它很强大,我们知道它会打开一扇我们完全没有预料到的大门。
There are current examples of marketing departments experimenting with some good and some not-so-good outcomes, but the promise remains enormous.
目前市场营销部门正在进行一些尝试,有些结果很好,有些结果不那么好,但前景是光明的。
In advertising, machine learning works overtime to get the right message to the right person at the right time.
在广告行业中,机器学习不知疲倦的工作,以便在合适的时间向正确的人传达正确的营销信息。
The machine folds response rates back into the algorithm, not just the database.
机器将回馈结果再应用于算法当中,而不仅仅是存在数据库中。
In the realm of customer experience, machine learning rapidly produces and takes action on new data-driven insights, which then act as new input for the next iteration of its models.
在客户体验领域,机器学习可以基于数据快速的生成新的洞察并采取下一步行动,得到的结果接着可以作为下一次迭代模型的输入。
Businesses use the results to delight
企业可以利用这些结果
customers, anticipate needs, and achieve competitive advantage.
来满足客户需求,预测需求并取得竞争优势。
Consider the telecommunications company that uses automation to respond to customer service requests quicker or the bank that uses data on past activity to serve up more timely and relevant offers to customers through e-mail or the retail company that uses beacon tech- nology to engage its most loyal shoppers in the store.
想想电信公司,他们利用自动化科技已经能更快地响应客户服务请求,或者银行,他们基于历史数据利用电子邮件给客户提供更及时和更相关的服务,或者零售公司,利用信标技术去更好的与最忠诚的购物者互动。
Don’t forget media companies using machine learning to track cus- tomer preference data to analyze viewing history and present person- alized content recommendations.
不要遗漏那些媒体公司,他们使用机器学习来跟踪客户偏好,进而分析用户浏览历史,给他们呈现个性化的内容推荐。
In “The Age of Analytics:
在“分析时代:在数据驱动的世界中竞争”中,
Competing in a Data-Driven World,”5 McKinsey Global Institute studied the areas in a dozen industries that were ripe for disruption by AI.
麦肯锡全球研究所研究了十几个行业,这些行业被AI深刻的影响。
Media was one of them.
媒体行业位列其中。
(See Figure 1.1.)6
(见图1.1)
IS AI SO GREAT IF IT’S SO EXPENSIVE?
如果AI太昂贵,那么它还会那么棒吗?
As you are an astute businessperson, you are asking whether the investment is worth the effort.
由于你是一个精明的生意人,你会问你的投资是否值。
After all, this is experimental stuff and Google is still trying to teach a car how to drive itself.
毕竟,AI还处在实验性的状态,谷歌仍然在尝试教汽车如何自己驾驶。
Christopher Berry, Director of Product Intelligence for the Canadian Broadcasting Corporation, puts the business spin on this question.7
加拿大广播公司产品情报总监克里斯托弗贝瑞(Christopher Berry)认为AI的商业应用取决于这个问题的答案。
Look at the progress that Google has made in terms of its self-driving car technology.
让我们看看谷歌在自动驾驶汽车技术方面取得的进步。
They invested years and years and years in computer vision, and then training machines to respond to road conditions.
他们在计算机视觉方面努力了很多很多年,以便训练机器以应对道路状况。
Then look at the way that Tesla has been able to completely catch up by way of watching its drivers just use the car.
接着我们看看特斯拉,它仅仅是通过观察它的驾驶员就能学习开车。
The emotional reaction that a data scientist is going to have is, “I’m building machine to be better than a human being.
数据科学家会给出的情绪反应是:“我正在建造一台比人类表现更好的机器。
Why would I want to bring a machine up to the point of it being as bad as a human being?”
为什么我要把机器训练到与人类一样糟糕的程度呢?“
8
Machine learning opportunities in media
机器学习在媒体行业中的机遇
Highest-ranked use cases, based on survey responses
Figure 1.1 A McKinsey survey finds advertising and marketing highly ranked for disruption.
图1.1麦肯锡的一项调查发现,广告和营销会在颠覆而榜的前列。
The commercial answer is that if you can train a generic Machine Learning algorithm well enough to do a job as poorly as a human being, it’s still better than hiring an expensive human being because every single time that machine runs, you don’t have to pay its pension, you don’t have to pay its salary, and it doesn’t walk out the door and maybe go off to a competitor.
从商业上的回答是,如果你能够很好地把机器学习算法优化的够好,可以像人类一样那样工作,这台机器仍然比雇佣一个昂贵的劳动力工作划算。因为每次机器运行时,你都不需要给它支付养老金,你不必支付工资,它也不会离职,成为你的竞争对手。
And there’s a possibility that it could surpass a human intelligence.
而且很有可能,机器有可能拥有超越人类的智慧。
If you follow that argument all the way
如果你信奉这一点
through, narrow machine intelligence is good enough for problem subsets that are incredibly routine.
弱机器智能对于解决那些非常常规的问题已经够用了。
We have so many companies that are dedicated to marketing automation and to smart agents and smart bots.
现在已经有很多公司致力于实现营销自动化,智能代理,智能机器人。
If we were to enumerate all the jobs being done in marketing department and score them based on how much pain caused, and how esteemed they are, you’d have no shortage of start-ups trying to provide the next wave of mechanization in the age of information.
如果你列出了营销部门要做的所有工作,并基于每个工作带来的痛苦和获得的尊重进行打分,你就不会看不见很多的创业企业在这个信息时代致力于提供新的机械化实现的选项。
And heaven knows, we have plenty of well-paid people spending a great deal of time doing incredibly routine work.
老天知道,我们有很多高薪的人才却要花费大量时间做着普通之极的常规工作。
So machine learning is great.
所以机器学习很有用。
It’s powerful.
它很强大。
It’s the future of marketing.
他是营销的未来。
But just what the heck is it?
但它究竟是什么呢?
WHAT’S ALL THIS AI THEN?
那么究竟什么是AI呢?
What are AI, cognitive computing, and machine learning?
什么是AI,认知计算和机器学习?
In “The History of Artificial Intelligence,”8 Chris Smith introduces AI this way:
在“人工智能的历史”中,克里斯史密斯这样介绍AI:
The term artificial intelligence was first coined by John McCarthy in 1956 when he held the first academic conference on the subject.
人工智能这个术语最早是由约翰麦卡锡1956年基于这个主题上召开第一次学术会议时被创造出来的。
But the journey to understand if machines can truly think began much before that.
但是,探寻机器能否真正思考的想法很早就产生了。
In Vannevar Bush’s seminal work As We May Think (1945) he proposed a system which amplifies people’s own knowledge and understanding.
在Vannevar Bush的开创性着作As We May Think(1945)中,他提到了一个能够增强人们的知识和理解的系统。
Five years later Alan Turing wrote a paper on the notion of machines being able to simulate human beings and the ability to do intelligent things, such as play Chess (1950).
五年后,艾伦·图灵写了一篇论文,描绘了能够模拟人类并能像人一样运用智力的机器,比如下象棋(1950)。
In brief—AI mimics humans, while machine learning is a system that can figure out how to figure out a specific task.
简而言之,AI模拟人类,而机器学习是一个弄明白怎样去完成特定任务的系统。
According to SAS, multinational developer of analytics software, “Cognitive computing is based on self-learning systems that use machine-learning techniques to perform specific, humanlike tasks in an intelligent way.”9
根据软件开发商SAS的说法,“认知计算基于自学系统,它利用机器学习技术以智能的方式执行特定的,类似于人的任务。”
THE AI UMBRELLA
AI大伞
We start with AI, artificial intelligence, as it is the overarching term for a variety of technologies.
我们从AI,人工智能开始,因为它是各种技术词汇表里面的首要术语。
AI generally refers to making computers act like people.
人工智能一般指的是让计算机像人一样做事。
“Weak AI” is that which can do something very specific,
“弱AI”可以把一些特定的事情做的非常好,
10
very well, and “strong AI” is that which thinks like humans, draws on general knowledge, imitates common sense, threatens to become self-aware, and takes over the world.
“强AI”就像人类一样思考,从一般知识中学习借鉴,模仿常识,有潜力成为一个威胁:产生自我意识,并接管世界。
We have lived with weak AI for a while now.
我们已经和弱AI一起生活了一段时间了。
Pandora is very good at choosing what music you might like based on the sort of music you liked before.
Pandora非常擅长根据你以前喜欢的音乐挑出你可能喜欢的音乐。
Amazon is pretty good at guessing that if you bought this, you might like to buy that.
亚马逊很擅长猜测,如果你买了这个东西,你可能会想买另外一个。
Google’s AlphaGo beat Go world champion Lee Sedol in March 2016.
谷歌的人工智能AlphaGo在2016年3月击败了围棋世界冠军Lee Sedol。
Another AI system (DeepStack) beat experts at no-limit, Texas Hold’em Poker.10 But none of those systems can do anything else.
另一个人工智能系统(DeepStack)在无规则德州扑克中击败了扑克专家.但这些AI系统都不能做其他的事情。
They are weak.
他们属于弱AI。
Artificial intelligence is a large umbrella.
人工智能像一把大伞,下面包含了很多内容。
Under it, you’ll find visual recognition (“That’s a cat!”), voice recognition (you can say things like, “It won’t turn on” or “It won’t connect to the Internet” or “It never arrived”), natural language processing (“I think you said you wanted me to open the garage door and warm up your car.
在这个术语里面,你会发现视觉识别(“那是一只猫!”),语音识别(你可以这样说,“它不会打开”或“它不会连接到互联网”或“它永远不会到”),自然语言处理(“我想你说你要我打开车库门,给你的车加热,对吗?”)
Is that right?”), expert systems (“Based on its behavior, I am 98.3% confident that is a cat”), affective computing (“I see cats make you happy”), and robotics (I’m acting like a cat).
专家系统(“基于它的行为,我有98.3%的信心判断它是一只猫”),情感计算(“我看到猫让你很开心”)和机器人技术(我的表现就像一只猫)。
THE MACHINE THAT LEARNS
会学习的机器
The magic of machine learning is that it was designed to learn, not to follow strict rules.
机器学习的神奇之处在于它被设计为具有学习能力,而不是严格的遵循规则。
This is the most fundamental aspect to understand and the most important to remember when you hit that inevitable frus- tration when things start going slightly off-track.
当事情开始略微偏离轨道时,当你遇到不可避免的挫折时,要时刻记住这是机器学习最根本和最重要的特点。
A rules-based system does exactly what it’s told and nothing more.
一个基于规则的系统会完全按照它所说的那样做,仅此而已。
We are comforted by that.
我们对此也感到欣慰。
A command to send out a gazillion e-mails with the “<first_name>” after the salutation does precisely that.
在称呼之后加上“<first_name>”,然后大量的发出量电子邮件的命令刚好就是这样。
That’s good.
这很好。
Of course, when the database has something fishy in the first_name field, then somebody gets an e-mail that begins, “Hello, Null, how are you?”
当然,当数据库在第一个字段中有一些错误时,有人会收到一封电子邮件,写着“你好,Null,你好吗?”
Once humans know to look for those sorts of mistakes, we cre- ate processes to check and correct the data before hitting Send the next time.
一旦人类知道要查找的是这些类型的错误后,我们就会在下次发送之前运行检查程序,更正数据。
When a batch of e-mails goes out that all say, “Hello, <first_name>,
当一批电子邮件发出时,所有的内容都是:“你好,<fi rst_name>,
how are you?” and the e-mails all include those brackets and that underline, we know to flail the programmers until they find the errant semicolon that caused the problem.
最近怎么样?”而且所有的电子邮件都包括那些括号和下划线。这种情况发生以后,我们会一直去麻烦程序员,直到他找到引起整个问题的bug。
In both cases, we can backtrack, find the problem, and fix it.
在这两种情况下,我们都可以回溯,发现问题,然后解决问题。
Machine learning has data and is given a task.
机器学习拿到数据并且被赋予了一个任务。
It performs the task and waits for feedback.
它执行任务并等待反馈。
If the feedback is positive, it tries to duplicate the same actions the next time for a similar task.
如果反馈是正确的,它会尝试在下次执行类似任务时调取相同的操作。
If the feedback is negative, it requires information on where it went wrong.
如果反馈是错误的,它则需要关于错误的详细信息。
Guess the Animal
猜猜动物
In 1980, at the Personal Electronics store in Santa Barbara, California, I was shown a program that could guess the animal you had in mind.
1980年,在加利福尼亚州圣巴巴拉的电子商店,我见到了一个可以猜到你想什么动物的程序。
I was flabbergasted.
那时我惊呆了。
It was a simple case of Asimov’s “Any sufficiently advanced technology is indistinguishable from magic.”
这是Asimov的一个简单案例,“任何非常先进的技术都像魔术一样”。
It:
机器:
Think of an animal…
想一种动物…
Is it an elephant?
它是大象吗?(是或不是)
(y or n) Me: n
我:不是
It:
机器:
You win.
你赢了。
Help me learn from my mistake before you go
在你走之前,能帮助我从错误中吸取一些教训吗?
.
.
.
What animal were you thinking of?
你想的是什么动物?
Me: a rabbit
我:兔子
It:
机器:
Give me a question to distinguish a rabbit from an elephant.
给我提一个能区分兔子和大象的问题。
Me:
我:
Is it a small animal?
它是一种小动物吗?
It:
机器:
For a rabbit, what is the answer to your question?
对于一只兔子,你的问题的答案是什么?
(y or n)
(是或不是)
Me: y
我:是
It:
机器:
Thanks.
谢谢。
Play again?
再玩一次?
(y or n) Me: y
(是或不是)我:是
It:
机器:
Think of an animal .
想一种动物。
.
.
Is it a small animal?
它是一种小动物吗?
(y or n) Me: y
It:
机器:
Is it a rabbit?
它是兔子吗?
(y or n) Me: n
(是或不是)我:不是
It:
机器:
You win.
你赢了。
Help me learn from my mistake before you go
在你走之前,能帮助我从错误中吸取一些教训吗?
.
.
.
What animal were you thinking of?
你想的是什么动物?
Me: a frog
我:青蛙
It:
机器:
Give me a question to distinguish a frog from rabbit and an elephant.
给我提一个能区分兔子和大象的问题。
Me:
我:
Is it green?
它是绿色的吗?
12
It:
机器:
Thanks.
谢谢。
Play again?
再玩一次?
(y or n)
(是或不是)
Me:
我:
No, I see how this works now.
不,我现在知道它是如何运作的。
It:
机器:
I don’t understand.
我不明白。
Play again?
再玩一次?
(y or n) Me: n
(是或不是)我:不是
It:
机器:
Goodbye
再见
Is this program rudimentary?
这个程序是不是不成熟?
You bet.
是的。
Was this machine learning?
这是机器学习吗?
Almost.
几乎是。
After running again and again, the game could guess exactly what animal you had in mind after only a few questions.
在一次又一次地运行之后,机器可以在几个问题后就准确猜出你想的动物是什么。
It was impres- sive, but it was just following programmed logic.
这令人印象深刻,但它只是按照既定的编程逻辑运行。
It was not learning.
这不是学习。
Guess the Animal could update its rules-based database and appear to be getting smarter as it went along, but it did not change how it made decisions.
猜猜动物机器可以基于规则不断更新数据库,随着次数越来越多,机器似乎越来越聪明,但它并没有改变它做决策的方式。
The Machine that Programs Itself
机器的自我编程
Machine learning systems look for patterns and try to make sense of them.
机器学习系统不断寻找模式并尝试弄清它们的意思。
It all starts with the question:
一切都从这个问题开始:
What problem are you trying to solve?
你在尝试解决什么问题?
Let’s say you want the machine to recognize a picture of a cat.
比如,您希望机器能识别出猫的图片。
Feed it all the pictures of cats you can get your hands on and tell it, “These are cats.”
拿着所有的猫的照片,然后告诉它,“这些是猫。”
The machine looks through all of them, looking for patterns.
机器查看所有这些照片,并寻找模式。
It sees that cats have fur, pointy ears, tails, and so on, and waits for you to ask a question.
它看到这些猫有毛皮,尖尖的耳朵,尾巴等等,接着等你问它问题。
“How many paws does a cat have?”
“猫有多少只爪子?”
“On average, 3.24.”
“平均而言,3.24只。”
That’s a good, solid answer from a regular database.
常规的数据库也可以得到这样的答案。
It looks at all the photos, adds up the paws, and divides by the number of pictures.
它查看所有照片,把爪子加起来,并按照图片数量进行计算。
But a machine learning system is designed to learn.
但机器学习系统旨在学习。
When you tell the machine that most cats have four paws, it can “realize” that it can- not see all of the paws.
当你告诉机器大多数猫都有四只爪子时,它可能会“意识到”它没有看到所有的爪子。
So when you ask,
所以,当你再问,
“How many ears does a cat have?”
“猫有多少只耳朵?”
“No more than two.”
“不超过两个。”
the machine has learned something from its experience with paws and can apply that learning to counting ears.
机器已经从爪子的经验中学到了一些东西,并可以将学习到的东西应用到耳朵上。
The magic of machine learning is building systems that build them- selves.
机器学习的神奇之处在于他们自己构建的系统。
We teach the machine to learn how to learn.
我们教机器如何学习。
We build systems that can write their own algorithms, their own architecture.
我们创建可以编写自己的算法的系统,编写他们自己的架构的系统。
Rather than learn more information, they are able to change their minds about the data they acquire.
而不仅仅去学习更多信息,系统能够改变他们获取的数据的看法。
They alter the way they perceive.
他们改变了他们的感知方式。
They learn.
他们会学习。
The code is unreadable to humans.
代码对人类来说是不可读的。
The machine writes its own code.
机器编写自己的代码。
You can’t fix it; you can only try to correct its behavior.
你无法修复它;你只能尝试纠正它的行为。
It’s troublesome that we cannot backtrack and find out where a machine learning system went off the rails if things come out wrong.
麻烦的是,如果出了问题,我们就无法回溯并找出机器学习系统偏离轨道的地方。
That makes us decidedly uncomfortable.
这让我们感到非常的不舒服。
It is also likely to be illegal, especially in Europe.
它也可能是不合法的,特别是在欧洲。
“The EU General Data Protection Regulation (GDPR) is the most important change in data privacy regulation in 20 years” says the homepage of the EU GDPR Portal.11 Article 5, Principles Relating to Personal Data Processing, starts right out with:
“欧盟通用数据保护条例(GDPR)是20年来数据隐私监管中领域最重要的变化”欧盟GDPR门户网站的主页。其中第5条与个人数据处理有关的原则,开头这么表述:
Personal Data must be:
个人数据必须:
processed lawfully, fairly, and in a manner transparent to the data subject
合法,公平地处理,并以对数据主体透明的方式处理
collected for specified, explicit purposes and only those purposes
数据收集用于指定的,明确目的和并且仅用于那些目的
limited to the minimum amount of personal data necessary for a given situation
限于特定情况所需的最低程度的数据量
accurate and where necessary, up to date
准确,并在必要时,是最新的
kept in a form that permits identification of the data subject for only as long as is necessary, with the only exceptions being statistical or scientific research purposes pursuant to article 83a
保存的形式只允许在必要时关联到数据主体,唯一的例外是根据第83a条进行的统计或科学研究
Parliament adds that the data must be processed in a manner allowing the data subject to exercise his/her rights and protects the integrity of the data
议会补充说,数据处理的方式必须满足:数据主题能行使他/她的权利,并维护数据的完整性。
Council adds that the data must be processed in a manner that ensures the security of the data processed under the responsibility and liability of the data controller
理事会补充说,数据的处理应该以这种方式:在遵循数据掌控者的权利和义务下,必须保证数据处理的安全性。
Imagine sitting in a bolted-to-the-floor chair in a small room at a heavily scarred table with a single, bright spotlight overhead and a detective leaning in asking, “So how did your system screw
想象一下,坐在一个小房间里的一把螺栓连接的椅子上,面前有一张桌子,头上有一个明亮的聚光灯,还有一个侦探头靠近了询问你,
www.allitebooks.com
http://www.allitebooks.org/
14
this up so badly and how are you going to fix it?
“你的系统是怎么把事情弄砸了的?你准备怎么去修复?
Show me the decision-making process!”
告诉我决策流程!”
This is a murky area at the moment, and one that is being reviewed and pursued.
显然,目前这是一个模糊的区域,我们也正在检查并试图弄明白。
Machine learning systems will have to come with tools that allow a decision to be explored and explained.
机器学习系统必须提供一个工具,以便能够理解和解释决策过程。
ARE WE THERE YET?
我们到了这一步吗?
Most of this sounds a little over-the-horizon and science-fiction-ish, and it is.
大多数AI这听起来有些超出理解的,科幻的,实时也确实如此。
But it’s only just over the horizon.
但它只是在视线之外而已。
(Quick—check the publi- cation date at the front of this book!)
(快速检查本书的出版日期!)
The capabilities have been in the lab for a while now.
这些功能已经在实验室中运行了一段时间了。
Examples are in the field.
行业中已经有了一些例子。
AI and machine learn- ing are being used in advertising, marketing, and customer service, and they don’t seem to be slowing down.
人工智能和机器学习正应用于广告,营销和客户服务中,而且它们的应用似乎并没有放慢速度的意思。
But there are some projections that this is all coming at an alarming rate.12
而且基于一些预测,这应用程度在以惊人的速度发展。
According to researcher Gartner, AI bots will power 85% of all customer service interactions by the year 2020.
根据研究机构Gartner的预测,到2020年,AI机器人占据客户服务领域85%的份额。
Given Facebook and other messaging platforms have already seen significant adoption of customer service bots on their chat apps, this shouldn’t necessarily come as a huge surprise.
鉴于Facebook和其他社交平台已经在他们的聊天应用上大量的采用机器人客户,这个预测对你来说并不算大的惊喜。
Since this use of AI can help reduce wait times for many types of interactions, this trend sounds like a win for businesses and customers alike.
鉴于这种AI技术的使用有助于减少许多服务的时间,这种趋势听起来像对企业和客户都是有利的。
The White House says it’s time to get ready.
白宫也说我们准备好了。
In a report called “Preparing for the Future of Artificial Intelligence” (October 2016),13 the Executive Office of the President National Science and Technology Council Committee on Technology said:
在一份名为“为人工智能的未来做准备”的报告中(2016年10月),国家科学技术委员会的主席说:
The current wave of progress and enthusiasm for AI began around 2010, driven by three factors that built upon each other: the availability of big data from sources including
人工智能的趋势和热潮始于2010年左右,受到三个相互依赖的因素的推动:来自电子商务,商业,社交媒体,科学和政府的大数据;
e-commerce, businesses, social media, science, and government; which provided raw material for dramatically improved Machine Learning approaches and algorithms; which in turn relied on the capabilities of more powerful computers.
可以利用这些大数据的改进了的机器学习方法和算法;可以支持大规模运算的高性能计算机。
During this period, the pace of improvement surprised AI experts.
在这样的背景下,AI专家也对目前的发展感到惊讶。
For example, on a popular image recognition challenge14 that has a 5 percent human error rate according to one error measure, the best AI result improved from a 26 percent error rate in 2011 to
例如,在一个比较流行的图像识别挑战中,基于数据,人的错误率在5%左右,而最好的AI的识别结果,从2011年的26%错误率降低到
3.5 percent in 2015.
2015年的3.5%的错误率。
15
Simultaneously, industry has been increasing its investment in AI.
同时,行业一直在加大对人工智能的投资。
In 2016, Google Chief Executive Officer (CEO) Sundar Pichai said, “Machine Learning [a subfield of AI] is a core, transformative way by which we’re rethinking how we’re doing everything.
2016年,谷歌首席执行官(首席执行官)桑达皮采表示,“机器学习[人工智能的一个子领域]是一项重要变革,利用机器学习,我们正在重新思考我们做事的方式。
We are thoughtfully applying it across all our products, be it search, ads, YouTube, or Play.
我们正在考虑将其应用于我们的所有产品中,无论是搜索,广告,YouTube还是google Play。
And we’re in early days, but you will see us—in a systematic way—apply Machine Learning in all these areas.”
我们处于早期应用阶段,但你会看到我们 – 以系统的方式 – 在所有方面应用机器学习。“
This view of AI broadly impacting how software is created and delivered was widely shared by CEOs in the technology industry, including Ginni Rometty of IBM, who has said that her organization is betting the company on AI.
这种人工智能会深刻的影响软件行业的观点被科技行业的首席执行官们所广泛的认同,其中包括IBM的Ginni Rometty,她曾表示她正在将公司的未来押宝在人工智能上。
The commercial growth in AI is surprising to those of little faith and not at all surprising to true believers.
人工智能的在商业领域的快速应用对那些持怀疑态度的人来说是令人惊讶的,对于真正的信徒来说则并不奇怪。
IDC Research “predicts that spending on AI software for marketing and related function businesses will grow at an exceptionally fast cumulative average growth rate (CAGR) of 54 percent worldwide, from around $360 million in 2016 to over $2 billion in 2020, due to the attractiveness of this technology to both sell-side suppliers and buy-side end-user customers.”15
IDC Research“预测,用于营销及相关业务的人工智能软件支出将以惊人速度增长,复合年均增长率(CAGR)增长将达到54%,从2016年的约3.6亿美元增长到2020年的20多亿美元,因为这项技术无论是对于卖方还是买方都非常具有吸引力。
Best to be prepared for the “ketchup effect,” as Mattias Östmar called it:
最好为“番茄酱效应”做好准备,正如MattiasÖstmar所说的那样:
“First nothing, then nothing, then a drip and then all of a sudden—splash!”
“一开始空空如也,什么也没有,然后出现了一滴水,然后突然水花四溅!”
You might call it hype, crystal-balling, or wishful thinking, but the best minds of our time are taking it very seriously.
你可以把它称之为炒作,预言或者一厢情愿,但我们这个时代最好的头脑都在非常认真地对待它。
The White House’s primary recommendation from the above report is to “examine whether and how (private and public institutions) can responsibly leverage AI and Machine Learning in ways that will benefit society.”
上述报告中白宫的主要建议是“审视(私人和公共机构)是否以及如何负责任地利用人工智能和机器学习,这将有利于社会的发展。”
Can you responsibly leverage AI and machine learning in ways that will benefit society?
您是否可以负责任地利用这些有利于社会的人工智能和机器学习呢?
What happens if you don’t?
如果你不这样做会怎么样?
What could possibly go wrong?
哪些地方什么可能出错?
AI-POCALYPSE
AI启示录
Cyberdyne will become the largest supplier of military computer systems.
Cyberdyne将成为军用计算机系统的最大供应商。
All stealth bombers are upgraded with Cyberdyne computers, becoming fully unmanned.
所有隐形轰炸机都由Cyberdyne计算机升级成,实现完全无人驾驶。
Afterwards, they fly with a perfect operational record.
之后,他们将获得完美的轰炸记录。
The Skynet Funding Bill is passed.
天网资金法案获得通过。
The system goes online August 4th, 1997.
该系统于1997年8月4日上线。
Human decisions are removed from
人类决策将从
16
strategic defense.
战略防御系统中消除。
Skynet begins to learn at a geometric rate.
天网开始以几何速度自我学习。
It becomes self-aware at 2:14 a.m. Eastern time, August 29th.
它在8月29日东部时间凌晨2点14分产生自我意识。
In a panic, they try to pull the plug.
在恐慌中,人们试图拔掉插头。
The Terminator, Orion Pictures, 1984
终结者,猎户座影业,1984
At the end of 2014, Professor Stephen Hawking rattled the data science world when he warned, “The development of full artificial intelligence could spell the end of the human race ….
2014年底,斯蒂芬霍金教授发出警告说:“完美的人工智能可能意味着人类的终结……”
It would take off on its own, and re-design itself at an ever increasing rate.
它会自行进化,并以不断增长的速度重新设计自己。
Humans, who are limited by slow biological evolution, couldn’t compete and would be superseded.”16
受限于缓慢的生物进化,人类无法与机器竞争并被机器超越。
In August 2014, Elon Musk took to Twitter to express his misgivings:
2014年8月,Elon Musk在推特表达了他的疑虑:
“Worth reading Superintelligence by Bostrom.
“博斯特罗姆的超级智能值得一读。
We need to be super careful with AI.
我们需要对AI非常小心。
Potentially more dangerous than nukes,” (Figure 1.2) and “Hope we’re not just the biological boot loader for digital superin- telligence.
它们可能比核武器更危险,”(图1.2)”希望人类不是超级数字智能的生物孕育者。
Unfortunately, that is increasingly probable.”
不幸的是,这种可能性越来越大。”
In a clip from the movie Lo and Behold, by German filmmaker Werner Herzog, Musk says:
在电影“Lo and Behold”的片段中,按照德国制片人Werner Herzog的撰稿,马斯克说:
I think that the biggest risk is not that the AI will develop a will of its own, but rather that it will follow the will of people that establish its utility function.
我认为最大的风险并不是AI会发展出自己的意志,而是它会遵循建立其效用目标的人的意愿。
If it is not well thought out—even if its intent is benign—it could have quite a bad outcome.
如果没有经过深思熟虑 – 即使机器的意图是好的 – 它可能会带来相当糟糕的结果。
If you were a hedge fund or private equity fund and you said, “Well, all I want my AI to do is
如果你是对冲基金或私募股权基金管理人,你说,“好吧,我要AI做的就是
Figure 1.2 Elon Musk expresses his disquiet on Twitter.
图1.2 Elon Musk在Twitter上表达了他的不安。
17
maximize the value of my portfolio,” then the AI could decide, well, the best way to do that is to short consumer stocks, go long defense stocks, and start a war.
最大化我的投资组合的价值,“然后人工智能可能认为,最好的方法是做空消费股,做多国防装备股,然后开始一场战争。
That would obviously be quite bad.
那显然是非常糟糕的。
While Hawking is thinking big, Musk raises the quintessential Paperclip Maximizer Problem and the Intentional Consequences Problem.
虽然霍金想的更深远,但马斯克提出了典型的回形针问题(一个回形针最大化的机器最终会毁灭世界,就像多米诺骨牌一样一块接着一块,将社会体系彻底瓦解。
)和自我意识问题。
The AI that Ate the Earth
吃掉地球的人工智能
Say you build an AI system with a goal of maximizing the number of paperclips it has.
假设您构建了一个AI系统,它的目标是最大化其拥有的回形针数量。
The threat is that it learns how to find paperclips, buy paperclips (requiring it to learn how to make money), and then work out how to manufacture paperclips.
问题是它学会了如何找到回形针,购买回形针(这又要求它学习如何赚钱),然后研究如何制造出回形针。
It would realize that it needs to be smarter, and so increases its own intelligence in order to make it even smarter, in service of making paperclips.
它会意识到它需要更聪明,因此增加自己的智能,以使其更加智能,以便为制作回形针服务。
What is the problem?
问题在哪儿呢?
A hyper-intelligent agent could figure out how to use nanotech and quantum physics to alter all atoms on Earth into paperclips.
超级人工智能知道如何使用纳米技术和量子物理技术将地球上的所有原子转换成回形针。
Whoops, somebody seems to have forgotten to include the Three Laws of Robotics from Isaac Asimov’s 1950 book, I Robot:
哎呀,有人似乎忘了加入机器人三原则,三原则由阿西莫夫在1950年出版的一本书提出,我在此引用一下:
A robot may not injure a human being, or through inaction, allow a human being to come to harm.
机器人不能伤害人类,也不能通过什么都不做,让人类受到伤害。
A robot must obey orders given it by human beings except where such orders would conflict with the First Law.
机器人必须遵守人类下达的命令,除非这些命令与第一定律相冲突。
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.
机器人必须保护自己的生存权利,只要这种生存权利不与第一或第二定律相冲突。
Max Tegmark, president of the Future of Life Institute, ponders what would happen if an AI
未来生命研究所所长马克斯·泰格马克(Max Tegmark)思考:
is programmed to do something beneficial, but it develops a destructive method for achieving its goal:
如果AI被要求做一些有益于世界的目标,但它开发了一种实现其目标的破坏性方法,这回带来什么后果:
This can happen whenever we fail to fully align the AI’s goals with ours, which is strikingly difficult.
只要AI的目标与我们的目标不是完全一致,就会发生这种情况,这种情形很难避免。
If you ask an obedient intelligent car to take you to the airport as fast as possible, it might get you there chased by helicopters and covered in vomit, doing not what you wanted but literally what you asked for.
如果你要求一辆听话的智能汽车尽可能快地带你去机场,它可能会让你超速,并被警察的直升机追赶,你车内充满了你的呕吐物。这肯定不是你想要的,但却符合你的字面要求。
If a superintelligent system is tasked with a(n) ambitious geoengineering project, it might wreak havoc with our ecosystem as a side effect, and view human attempts to stop it as a threat to be met.17
如果一个超级智能系统的任务是完成一个雄心勃勃的地球工程项目,它可能会对我们的生态系统造成严重破坏,并将人类阻止它的企图作为一种威胁来对待。
18
If you really want to dive into a dark hole of the existential problem that AI represents, take a gander at “The AI Revolution:
如果你真的想深入探讨人工智能所代表的生存问题的黑暗面,推荐你看一看
Our Immor- tality or Extinction.”18
“AI革命:人类的不朽或灭绝”。
Intentional Consequences Problem
恶意结果的问题
Bad guys are the scariest thing about guns, nuclear weapons, hacking, and, yes, AI.
枪支,核武器,黑客攻击的这些是最可怕的事情,对,还有人工智能。
Dictators and authoritarian regimes, people with a grudge, and people who are mentally unstable could all use very powerful soft- ware to wreak havoc on our self-driving cars, dams, water systems, and air traffic control systems.
独裁者和独裁政权,心有怨恨的人以及精神状态不稳定的人都可以利用强大的软件对我们的自动驾驶汽车,水坝,供水系统和空中交通控制系统造成严重的破坏。
That would, to repeat Mr. Musk, obviously be quite bad.
这就是重复马斯克先生的判断,很显然这样是非常糟糕的。
That’s why the Future of Life Institute offered “Autonomous Weapons:
这就是为什么生命未来研究所撰写了“自主武器:人工智能和机器人研究领域从业人员的公开信”,
An Open Letter from AI & Robotics Researchers,” which concludes, “Starting a military AI arms race is a bad idea, and should be prevented by a ban on offensive autonomous weapons beyond meaningful human control.”19
得到的一个结论是:“开展AI军备竞赛是一个坏主意,应该通过禁止研发进攻性自主武器的研发,因为这些很显然已经超出了人类的掌控“
In his 2015 presentation on “The Long-Term Future of (Artificial) Intelligence,” University of California, Berkeley professor Stuart Russell asked, “What’s so bad about the better AI?
加州大学伯克利分校教授斯图尔特罗素2015年在“人工智能的长期未来”的演讲中问道:“更好的人工智有什么坏处?
AI that is incredibly good at achieving something other than what we really want.”
人工智能非常擅长实现我们真正想要的东西。“
Russell then offered some approaches to managing the it’s- smarter-than-we-are conundrum.
然后拉塞尔提出了一些方法来解决AI比我们更聪明的难题。
He described AIs that are not in control of anything in the world, but only answer a human’s questions, making us wonder whether it could learn to manipulate the human.
他设想了一种不能控制世界上任何东西的AI,这种AI只能人类的问题,这让我们想知道它是否可以学会操纵人类。
He suggested creating an agent whose only job is to review other AIs to see if they are potentially dangerous and admitted that was a bit of a paradox.
他建议创建一个代理人,它唯一的工作就是审查其他AI,看看它们是否具有潜在的危险性,但他承认这有点悖论。
He’s very optimistic, however, given the economic incentive for humans to create AI systems that do not run amok and turn people into paperclips.
但是,他非常乐观,因为人类有经济动力去创建某些人工智能,这些人工智能系统不会杀气腾腾并想将人们变成回形针。
The result will inevitably be the development of community standards and a global regulatory framework.
人们最终会制定行业的标准和全球性的监管框架。
Setting aside science fiction fears of the unknown and a madman with a suitcase nuke, there are some issues that are real and deserve our attention.
撇开对未知科学的恐惧以及带着控制核武器的行李箱的疯子,有一些问题是存在的,并值得我们关注。
Unintended Consequences
意想不到的后果
The biggest legitimate concern facing marketing executives when it comes to machine learning and AI is when the machine does what you tell it to do rather than what you wanted it to do. This is much like the paperclip problem, but much more subtle.
当提到机器学习和人工智能时,营销人员最大的担忧是,当机器其实是按照你的指示去做事情,但却不是您本来希望它做的事情。这很像回形针问题,但处境更加微妙。
In broad terms, this
广义上来讲,
19
is known as the alignment problem.
这可以被称为均等问题。
The alignment problem wonders how to explain to an AI system goals that are not absolute, but take all of human values into consideration, especially considering that val- ues vary widely from human to human, even in the same community.
均等问题想知道如何向AI系统解释,给定目标不是绝对的,还需要考到到其他人的利益。特别是考虑到人与人之间的利益差异很大,即使同一个团体里也是如此。
And even then, humans, according to Professor Russell, are irrational, inconsistent, and weak-willed.
不仅如此,根据拉塞尔教授的说法,人类也是非理性的,前后不一致的,意志薄弱的。
The good news is that addressing this issue is actively happening at the industrial level.
好消息是,行业层面正在解决这个问题。
“OpenAI is a non-profit artificial intelligence research company.
“OpenAI是一家非盈利性的人工智能研究公司。
Our mission is to build safe AI, and ensure AI’s benefits are as widely and evenly distributed as possible.”20
我们的使命是建立安全的人工智能,并确保人工智能的带来的利益尽可能被广泛和均匀地分配。“
The other good news is that addressing this issue is actively hap- pening at the academic/scientific level.
另一个好消息是,学术/科学层面也在积极的解决这个问题。
The Future of Humanity Insti- tute teamed with Google to publish a paper titled “Safely Interruptible Agents.”21
人类未来的研究所与谷歌合作发表了一篇名为“安全可中断主体”的论文。
Reinforcement learning agents interacting with a complex environment like the real world are unlikely to behave optimally all the time.
与现实世界等高度复杂环境相互作用的强化学习主体人不可能一直有最佳的表现。
If such an agent is operating in real-time under human supervision, now and then it may be necessary for a human operator to press the big red button to prevent the agent from continuing a harmful sequence of actions—harmful either for the agent or for the environment—and lead the agent into a safer
如果人们需要监督这样的主体的运作,那么监督人员可能需要一个大红色按钮,以中断主体继续执行有害的操作 – 对主体或对环境。这个按钮还可以让主体进入
situation.
安全的环境。
However, if the learning agent expects to receive rewards from this sequence, it may learn in the long run to avoid such interruptions, for example by disabling the red button—which is an undesirable outcome.
但是,如果学习主体期望从目标结果中获取奖励,主体会通过长期学习以规避这种中断操作,例如通过禁用红色按钮 – 这是我不愿看到的结果。
This paper explores a way to make sure a learning agent will not learn to prevent (or seek!) being interrupted by the environment or a human operator.
本文探讨了一种确保学习主体不会学习防止(或寻求!)被环境或人类操作员打断的方法。
We provide a formal definition of safe interruptibility and exploit the off-policy learning property to prove that either some agents are already safely interruptible, like Q-learning, or can easily be made so, like Sarsa.
我们提供安全可中断的正式定义,并利用离线学习属性来证明某些主体已经可以实现安全的中断,如Q-learning,或者Sarsa。
We show that even ideal, uncomputable reinforcement learning agents for (deterministic) general computable environments can be made safely interruptible.
我们认为,即使是在普通的可计算环境中,不可计算的强化学习主体也能被安全的中断。
There is also the Partnership on Artificial Intelligence to Benefit People and Society,22 which was “established to study and formulate best practices on AI technologies, to advance the public’s under- standing of AI, and to serve as an open platform for discussion and engagement about AI and its influences on people and society.”
还有‘个人、社会的与人工智能的合作伙伴关系’机构,它“旨在研究和制定最佳的人工智能技术的规范,促进公众对人工智能的理解,并提供一个讨论和参与平台,主题就是AI对于人类和社会的影响
20
Granted, one of its main goals from an industrial perspective is to calm the fears of the masses, but it also intends to “support research and recommend best practices in areas including ethics, fairness, and inclusivity; transparency and interoperability; privacy; collaboration between people and AI systems; and of the trustworthiness, reliability, and robustness of the technology.”
从行业的角度来看,其主要的一个目标是安抚大众的恐慌,但它也打算“支持研究并在以下领域提供建议规范:道德,公平,包容,透明度和互操作性;隐私;人与AI系统之间的协作。AI是否值得信任,可靠性和稳健。”
The Partnership on AI’s stated tenets23 include:
人工智能的伙伴关系中陈述的原则包括:
We are committed to open research and dialog on the ethical, social, economic, and legal implications of AI.
我们致力于开展关于人工智能的道德,社会,经济和法律影响的研究和对话。
We will work to maximize the benefits and address the potential challenges of AI technologies, by:
我们将努力最大化AI带来的利益并解决人工智能技术可能带来的潜在挑战,通过:
Working to protect the privacy and security of individuals.
致力于保护个人的隐私和安全。
Striving to understand and respect the interests of all parties that may be impacted by AI advances.
努力理解并尊重可能受人工智能影响的所有各方的利益。
Working to ensure that AI research and engineering communities remain socially responsible, sensitive, and engaged directly with the potential influences of AI technologies on wider society.
确保人工智能研究和工程社区对社会负责,积极反馈,并正视人工智能技术对社会的广泛影响。
Ensuring that AI research and technology is robust, reliable, trustworthy, and operates within secure constraints.
确保人工智能研究和技术稳健,可靠,值得信赖,并在安全约束下运行。
Opposing development and use of AI technolo- gies that would violate international conventions or human rights, and promoting safeguards and technologies that do no harm.
反对开发和使用违反国际公约或人权的技术,并推进提供防止伤害的保障和技术。
That’s somewhat comforting, but the blood pressure lowers con- siderably when we notice that the Partnership includes the American Civil Liberties Union.
伙伴关系机构中包括美国公民自由联盟,这令人欣慰,少了些许担心。
That makes it a little more socially reliable than the Self-Driving Coalition for Safer Streets, which is made up of Ford, Google, Lyft, Uber, and Volvo without any representation from little old ladies who are just trying to get to the other side.
这使得它比自动驾车联盟更加可靠,这个联盟由福特,谷歌,Lyft,Uber和沃尔沃组成,没有任何行人的代表,这些行人只想从马路这边走到另一边。
Will a Robot Take Your Job?
机器人会抢你的工作吗?
Just as automation and robotics have displaced myriad laborers and word processing has done away with legions of secretaries, some jobs will be going away.
正如自动化技术和机器人技术已经取代无数劳动者并且文字处理软件已经减少了大量的秘书,一些工作机会将会因AI而消失。
The Wall Street Journal article, “The World’s Largest Hedge Fund Is Building an Algorithmic Model from Its Employees’ Brains,”24 reported
华尔街日报的文章“世界上最大的对冲基金正在基于员工脑子里的知识构建算法模型”,
21
on $160 billion Bridgewater Associates trying to embed its founder’s approach to management into a so-called Principles Operating System.
据报道,管理着160亿美金的Bridgewater Associates试图将其创始人的管理方法导入到所谓的《原则操作系统》中。
The system is intended to study employee reviews and testing to del- egate specific tasks to specific employees along with detailed instruc- tions, not to mention having a hand in hiring, firing, and promotions.
该系统旨在进行员工评审和测试,以便将特定任务分配给特定员工并给予详细的指导,当然还有其他领域如招聘,解雇和促销。
Whether a system that thinks about humans as complex machines can succeed will take some time.
这种将可以取代人类工作的机器是否能研制成功还需要一些时间。
A Guardian article sporting the headline “Japanese Company Replaces Office Workers with Artificial Intelligence”25 reported on an insurance company at which 34 employees were to be replaced in March 2017 by an AI system that calculates policyholder payouts.
“卫报”的文章“日本公司取代具有人工智能的工人”,报道了一家保险公司Fukoku Mutual Life Insurance,它的34名员工将于2017年3月被一个能计算保单支付额的人工智能系统替代。
Fukoku Mutual Life Insurance believes it will increase productivity by 30% and see a return on its investment in less than two years.
Fukoku Mutual Life Insurance相信它可以将生产效率提高30%,并在不到两年的时间内收回投资。
The firm said it would save about 140m yen (£1m) a year after the 200m yen (£1.4m)
该公司表示,在花了2亿日元(140万英镑)之后,之后它每年将帮公司节省约1.4亿日元(100万英镑)
AI system is installed this month.
这个系统本月开始安装。
Maintaining it will cost about 15m yen (£100k) a year.
维护它每年将花费约1500万日元(10万英镑)。
The technology will be able to read tens of thousands of medical certificates and factor in the length of hospital stays, medical histories and any surgical procedures before calculating payouts, according to the Mainichi Shimbun.
据“每日新闻”报道,这项技术在计算保费支出之前,会阅读数以万计的医疗证明,并会考虑住院时间,病史和任何外科手术因素。
While the use of AI will drastically reduce the time needed to calculate Fukoku Mutual’s payouts—which reportedly totalled 132,000 during the current financial year—the sums will not be paid until they have been approved by a member of staff, the newspaper said.
报道称,人工智能的使用将大大减少Fukoku Mutual在计算保费支出上面所花的时间 – 据报道,在当前的财政年度,这笔支出总计为132,000 – 款项将在获得工作人员批准后支付。
Japan’s shrinking, ageing population, coupled with its prowess in robot technology, makes it a prime testing ground for AI.
日本经济一直在萎缩,还有人口老龄化问题,加上日本在机器人技术方面的实力,这些因素使其成为人工智能的主要试验场。
According to a 2015 report by the Nomura Research Institute, nearly half of all jobs in Japan could be performed by robots by 2035.
根据野村综合研究所2015年的一份报告,到2035年,日本近一半的工作岗位会由机器人替代。
I plan on being retired by then.
我计划到那时候退休。
Is your job at risk?
你的工作受到威胁了吗?
Probably not.
可能没有。
Assuming that you are either a data scientist trying to understand marketing or a marketing person trying to understand data science, you’re likely to keep your job for a while.
如果您是一位试图了解市场营销的数据科学家,或者是一位想要了解数据科学的营销人员,那么您可能还能保留一段时间的工作。
In September 2015, the BBC ran its “Will a Robot Take Your Job?”26 feature.
2015年9月,英国广播公司做了一个调查“机器人将拿走你的工作吗?”。
Choose your job title from the dropdown menu and
从下拉菜单中选择您的职位
22
Figure 1.3 Marketing and sales managers get to keep their jobs a little longer than most.
图1.3营销和销售经理可以比大多数人更长时间地保留工作岗位。
voilà!
看看!
If you’re a marketing and sales director, you’re pretty safe.
如果您是营销和销售总监,那么您的职位就非常安全。
(See Figure 1.3.)
(见图1.3。)
In January 2017, McKinsey Global Institute published “A Future that Works:
2017年1月,麦肯锡全球研究院发表了“未来:
Automation, Employment, and Productivity,”27 stat- ing, “While few occupations are fully automatable, 60 percent of all occupations have at least 30 percent technically automatable activities.”
自动化,就业和生产力,”写到”,虽然只有很少的职业可以实现完全自动化的,但60%的职业至少有30%的部分技术上可以实现自动化。
The institute offered five factors affecting pace and extent of adoption:
该研究所提供了影响应用速度和程度的五个因素:
Technical feasibility:
技术可行性:
Technology has to be invented, integrated, and adapted into solutions for specific case use.
技术必须被开发,集成,做成针对特定用途的解决方案。
Cost of developing and deploying solutions:
开发和部署解决方案的成本:
Hardware and software costs.
硬件和软件成本。
Labor market dynamics:
劳动力市场状况:
The supply, demand, and costs of human labor affect which activities will be automated.
人力的供应,需求和成本会决定哪些工作将被自动化取代。
Economic benefits:
经济利益:
Include higher throughput and increased quality, alongside labor cost savings.
包括更高的产出和更高的品质,以及节省劳动力成本。
Regulatory and social acceptance:
监管和社会接受度:
Even when automation makes business sense, adoption can take time.
即使自动化具有经济上的意义,应用也需要时间。
23
Christopher Berry sees a threat to the lower ranks of those in the marketing department.28
克里斯托弗贝瑞认为市场营销部门的底层工作人员会受到威胁。
If we view it as being a way of liberating people from the drudgery of routine within marketing departments, that would be quite a bit more exciting.
如果我们将这种取代视为一种将人们从营销部门的枯燥劳动中解放出来的方式,这将会令人振奋。
People could focus on the things that are most energizing about marketing like the creativity and the messaging—the stuff people
人们可以专注于市场营销中最有激情的事情,比如创意和信息传递 –
enjoy doing.
这些人们喜欢做的事情。
I just see nothing but opportunity in terms of tasks that could be automated to liberate humans.
我看到了通过自动化去解放人力的的机会。
On the other side, it’s a typical employment problem.
另一方面来说,这是一个典型的就业问题。
If we get rid of all the farming jobs, then what are people going to do in the economy?
如果我们摆脱所有的工作,那么人们将在经济体系中还能做些什么呢?
It could be a tremendous era of a lot more displacement in white collar marketing departments.
这个时代,很多营销部门的白领可能会失去工作。
Some of the first jobs to be automated will be juniors.
那些初级工作可能被自动化系统取代。
So we could be very much to a point where the traditional career ladder gets pulled up after us and that the degree of education and professionalism that’s required in marketing just increases and increases.
因此,我们会面对这样一个现象:传统的职业通路被拉高,营销行业未来会需要更多的具有良好教育背景和专业知识的人来从事。
So, yes, if you’ve been in marketing for a while, you’ll keep your job, but it will look very different, very soon.
所以,是的,如果你已经在营销行业里面工作了一段时间,你可以保住你的工作,但世界很快就会看起来很不一样。
MACHINE LEARNING’S BIGGEST ROADBLOCK
机器学习中最大的路障
That would be data.
那就是数据。
Even before the application of machine learn- ing to marketing, the glory of big data was that you could sort, sift, slice, and dice through more data than previously computationally possible.
在机器学习应用于市场营销之前,大数据的吸引力就是你可以通过比以前更快的速度对大数据进行排序,筛选,切片和切块。
Massive numbers of website interactions, social engagements, and mobile phone swipes could be sucked into an enormous database in the cloud and millions of small computers that are so much better, faster, and cheaper than the Big Iron of the good old mainframe days could process the heck out of it all.
大量的网站互动数据,社交活动数据和移动电话数据都可以被存储在海量的云数据库中。数百万台小型计算机可以比以前大型机时代的Big Iron更好,更快,更便宜的处理这些数据了。
The problem then—and the problem now—is that these data sets do not play well together.
那么现在的问题依旧是-这些数据不能很好的协同。
The best and the brightest data scientists and analysts are still spending an enormous and unproductive amount of time performing janitorial work.
一群最厉害和最聪明的数据科学家和分析师仍然把大量时间花在数据清洗上。
They are ensuring that new data streams are properly vetted, that legacy data streams continue to flow reliably, that the data
他们的工作就是确保对新数据流进行适当审查,保证旧数据流继续可靠地运行,
24
that comes in is formatted correctly, and that the data is appropriately groomed so that all the bits line up.
流入的数据的格式是正确的,并且数据被适当地整理,是有序的。
Data set A starts each week on Monday rather than Sunday.
数据集A在每周的星期一而不是星期日开始收集。
Data set B drops leading zeros from numeric fields.
数据集B从字段中删除前面的数字零。
Data set C uses dashes instead of parentheses in phone numbers.
数据集C在电话号码中使用破折号而不是括号。
Data set D stores dates European style (day, month, year).
数据集D存储日期格式是欧洲风格(日,月,年)。
Data set E has no field for a middle initial.
数据集E是没有中间字母的字段。
Data set F stores transaction numbers but not customer IDs.
数据集F存储交易号但不存储客户ID。
Data set G does not include in-page actions, only clicks.
数据集G不包含页内操作,仅包含点击数。
Data set H stores a smartphone’s IMEI or MEID number rather than its phone number.
数据集H存储智能手机的IMEI或MEID号码而不是其电话号码。
Data set I is missing a significant number of values.
数据集I缺少大量的值。
Data set J uses a different scale of measurements.
数据集J使用不同的测量单位。
Data set K, and so on.
数据集K…,依此类推。
It’s easy to see how much work goes into data cleansing and normal- ization.
很容易弄清楚有多少工作是花在数据清理和规范化上的。
This seems to be a natural challenge for a machine learning application.
这是机器学习经常会遇见的挑战。
Sure enough, there are academics and data scientists working on this, but they’re a long way off.
当然,有一些学者和数据科学家正在研究这个问题,但他们还有很长的路要走。
How can you tell?
你怎么知道的?
In their paper titled “Probabilistic Noise Identification and Data Cleaning,”29 Jeremy Kubica and Andrew Moore describe their work on not throwing out entire records when only some of the fields are contaminated.
在题为“概率噪声识别和数据清理”的论文中,杰里米库比卡和安德鲁摩尔描述了他们的工作,当只有一些数据受到污染时,不会丢弃整个记录。
“In this paper we present an approach for identifying corrupted fields and using the remaining non-corrupted fields for sub- sequent modeling and analysis.
“在本文中,我们提出了一种方法,可以用于识别破坏的字段,并利用剩余的字段进行建模和分析。
Our approach learns a probabilistic model from the data that contains three components: a generative model of the clean data points, a generative model of the noise values, and a probabilistic model of the corruption process.”
我们的方法是从包含三个组成部分的数据中学习概率模型:基于清洁数据点生成的模型,噪声值生成的模型,以及破坏数据的概率模型。
It’s a start.
这只是个开始。
MACHINE LEARNING’S GREATEST ASSET
机器学习最伟大的资产
That would be data.
那就是数据。
Machine learning has a truly tough time with too little information.
机器学习在信息太少的情况下很难实现好的结果。
If you give it only one example, it can tell you exactly what to expect the next time with 100 percent confidence.
如果你只给出一个例子,它100%的肯定的告诉你下一次的结果。
It will be wrong.
但那么它肯是错的。
Machine learning doesn’t work like statistics.
机器学习不像统计学那样工作。
Statistics can tell you the likelihood of a coin toss or the probability of a plane crash.
统计数据可以告诉您掷硬币结果的概率或者飞机失事的可能性。
25
PROBABILITY OF A PLANE CRASH
飞机坠毁的概率
Three statisticians are in a plane when the pilot announces that they’ve lost one of their engines.
三名统计学家正在飞机上,这时候飞行员告诉他们一台发动机坏了。
“But it’s okay, folks, these planes were built to fly under the worst conditions.
“但是不用担心,伙计们,飞机可以承受这样的恶劣的状况。
It does mean, however, that we’re going to fly a bit slower and we’ll be about a half an hour late.
然而,这确实意味着我们将会非的慢一点,我们会迟到半小时左右。
Please don’t worry.
请不要担心。
Sit back, relax, and enjoy the rest of your flight.”
请做好,放松,剩下的旅途好好享受。“
The first statistician says, “There’s still a 25 percent chance that I’ll make my connection.”
第一位统计学家说:“我仍有25%的机会赶上车。”
Fifteen minutes later, the pilot is on the PA again.
十五分钟后,飞行员的声音再次出现在广播中。
“Ladies and gentlemen, we seem to have lost a second engine.
“女士们,先生们,飞机似乎又坏了一台发动机。
No problem, the others are still going strong.
没问题,剩下的发动机在正常的工作。
This does mean, however, that we’ll be about an hour late to the gate.
然而,这确实意味着我们要晚一个小时。
I’m so sorry for the inconvenience.”
对于给您带来的不便,我感到非常抱歉。“
The second statistician says, “There’s an 83 percent chance I’m going to miss my dinner.”
第二位统计学家说:“有83%的可能性我会错过我的晚餐了。”
After a half an hour, the pilot makes another announcement, “Ladies and Gents, we’ve lost yet another engine.
半小时后,飞行员再次宣布:“女士们,先生们,我们又失去了一台发动机。
Yes, I know this is bad, but there’s really no need to worry.
是的,我知道这很糟糕,但是真的没必要担心。
We’ll make it just fine, but we’re going to be two hours late to the airport.”
我们会没事的,但我们将要迟到两个小时才能到达机场了。“
The third statistician says, “That last engine better not fail or we’ll never land!”
第三位统计学家说:“最后一个发动机最好不要坏了,否则我们永远不会达到机场了!”
Human experience and ingenuity have worked wonders for mar- keting for hundreds of years: gut feel and common sense.
数百年来,人类的经验和创造力为早就了很多营销奇迹:直觉和常识。
When we added statistics to the mix, we expanded our experience by consider- ing historical precedent.
当我们加入统计数据时,我们利用历史经验深化了我们的知识。
But we still rely on gut feel as we feel around blindly in the data, hoping to stumble on something recognizable.
但是我们仍然依赖直觉,因为我们对数据中有种盲目的自信,希望可以偶然的发现有用的东西。
How We Used to Dive into Data
我们过去如何挖掘数据的
As the Board Chair of the Digital Analytics Association, I strove to explain how digital analysts go beyond answering specific questions.
作为数字分析协会的董事会主席,我尽力解释数字分析师如何工作,而不仅仅是回答特定问题。
I wrote the following in the Applied Marketing Analytics Journal, describing the role of the “data detective.”
我在Applied Marketing Analytics Journal中写了以下内容,描述了“数据侦探”的角色。
Discovering Discovery, Data Discovery Best Practices30
去发现发现,数据中发现最佳实践
A crystal ball is filled with nothing at all or smoke and clouds, mesmerizing the uninitiated, but very useful for the scrying specialist.
一个水晶球里面没有任何东西,没有烟雾和云彩,那些没有经验的人很迷恋,但对于专家来说却非常有用。
The crystal ball mystic is tasked with entertaining more than communicating genuine visions.
神秘的水晶球的任务是娱乐,而不是表达真实的愿景。
Creating something from nothing takes imagination, creativity, and the ability to read
从无到有的创造事物需要想象力,创造力和理解
26
one’s fellow man to determine what fictions they might consider valuable.
它的追随者,确定他们可能认为有价值的事情。
The medium who directs a séance is in much the same role.
引导思考的媒介也扮演着同样的角色。
Tarot Card readers are a step closer to practicality.
塔罗牌解毒器更实用。
They use their cards as conversation starters.
他们利用他们的卡来开启一段对话。
“You drew The Magician, which stands for creation and individuality, next to the Three of Cups, which represents a group of people working together.
“你选的是魔术师,代表着创造和个性,它紧挨着三人杯,代表着一群人在一起工作。
Are you working on a project with others right now?”
你现在正和别人一起做项目吗?“
The “mystical” conversation is all about the subject, and therefore, seems revelatory.
“神秘”的谈话完全是关于这个主题的,因此,它似乎具有启发性。
The Digital Analyst also has a crystal ball (The Database) and Tarot Cards (Correlations) with which to entice and enthrall the Truth Seeker.
数字分析师也有一个水晶球(数据库)和塔罗牌(相相关关系),利用它们可以吸引和迷住寻求真相的人。
The database is a mystery to the supplicant, and the correlations seem almost magical.
数据库对请求者来说是一个谜,相关关系的概念听起来很神奇。
The Digital Analyst has something more powerful than visions and more practical than psychology—although both are necessary in this line of work.
数字分析师拥有比视觉更强大,比心理学更实用的东西 – 尽管视觉和心理学在这一系列工作中也都是必要的。
The analyst has data; data that can be validated and verified.
分析师有数据,可以证实证和验证的数据。
Data that can be reliably used to answer specific questions.
数据可以被可靠地用于回答特定问题。
The Digital Analyst truly shines when seeking insight beyond the normal, predictable questions asked on a daily basis.
不仅仅是处理每天遇到的普通的,惯常的问题,数字分析师的洞察超越了这些,赢得了人们的尊重。
The analyst can engage in discovery; the art of uncovering important truths that can be useful or even transformative to those who would be data-driven.
分析师可以参与到发现之中去;去发现真理,这些真理对于那些依赖于数据驱动的业务至关重要。
Traditional Approach:
传统方法:
Asking Specific Questions
询问具体问题
A business manager wants to know the buying patterns of her customers.
业务经理想知道她的客户的购买模式。
A shipping manager wants to project what increased sales will mean to staffing.
运输经理想要预测增加的销售额对于自己员工的影响
A production manager wants to anticipate and accord- ingly adjust the supply chain.
生产经理希望预测并相应地调整供应链。
An advertising professional wants to see the compara- tive results of a half a dozen promotional campaigns.
广告专业人士希望获得六个促销活动的对比结果。
Each of these scenarios call for specific data to be assembled and tabulated to provide a specific answer.
这些场景中的每一个都需要基于特定数据,并制成表格以提供特定答案。
Proper data collecting, cleansing, and blending are required, and can be codified if the same questions are to be asked repeatedly.
需要把正确的数据收集,清理和整合,如果这些问题反复被提到,需要将这些答案编纂起来,以便参考。
And thus, reporting is born.
因此,报告就诞生了。
27
Reports are valuable and necessary .
这些报告当时是有价值的、必要的。
.
. until they are not.
后来就不在需要他们了。
Then they are the source of repetitive stress, adding no value to the organization.
然后它报告变成了重复性压力的来源,对组织没有任何价值。
The antidote is discovery.
解决方法就是发现。
Exploring Data
探索数据
An investigation is an effort to get data to reveal what it knows.
调查就是努力获取数据,并得到数据所揭示的东西。
(“Where were you on the night of the 27th?”).
(“你在27日晚上在哪里?”)。
But data discovery is the art of interviewing data to learn things you didn’t necessarily know you wanted to know.
但数据发现是与数据交谈的艺术,它可以学到,你不一定知道自己想知道的那些事情。
The Talented data explorer is much like the crystal ball gazer and the Tarot reader in several ways.
在某些方面,有才华的数据探索者与水晶球观察者和塔罗牌解读者非常相似。
They:
他们:
Have a method for figuring out what the paying cus- tomer wants to know.
有一种方法可以确定付费客户想知道的内容。
Have broad enough knowledge about the subject to recognize potentially interesting details.
对主题有足够广泛的知识,足够去识别潜在的有趣的细节。
Are sufficiently open minded to be receptive to details that might be relevant.
保持充分的开放态度,去吸收有可能有关联的细节。
Keep in close communication with the petitioner to guide the conversation.
与需求者保持密切沟通,以把握谈话的方向。
Understand the underlying principles well enough to push the boundaries.
充分了解基本原则以尝试突破边界。
Are curious by nature and enjoys the intellectual hunt.
本质上是好奇的,享受着知识分子式的探究过程。
Data discovery is part mind reading, part pattern recognition, and part puzzle solving.
数据发现的过程部分是思维解读、部分是模式识别、部分是拼图。
Reading the mind of the inquisitor is obligatory to ensure the results are of interest to those with control of the budget.
理解需求者的思维是必须的,可以以确保结果是控制预算的人感兴趣的。
Pattern recognition is a special skill that can be honed to help direct lines of enquiry and trains of thought.
模式识别是一项特殊技能,可以通过训练来帮助指导询问和思维训练。
An aptitude for detective work is the most important talent of the Digital Analyst; that ability to ponder the meaning of newly uncovered evidence.
侦探天赋是数字分析师最重要的品质;它能够思考新发现的证据的意义。
Data discovery is the art of mixing an infinitely large bowl of alphabet soup and being able to recognize the occasional message that floats to the surface in an assortment of languages.
数据发现是一种艺术,在一大杯混合了字母的大杂烩中,能够识别出浮在表面的偶然信息。
Although, with Big Data, adding more data variety to the mix, the Digital Analyst must also be able to read tea leaves, translate the I Ching, generate an astrological chart, interpret dreams, observe auras, speak in tongues, and sing with sirens in order to turn lead into gold.
虽然,通过大数据,更多数据内容添加进来,数字分析师还是能够识别茶叶,翻译I Ching,生成占星图表,解释梦境,观察光环,说方言,并随着铃声歌唱,最终把铅变成黄金。
28
Data discovery is all about the application of those human skills that computers have a tough time with reasoning, creativity, learning, intuition, application of incongruous knowledge, etc.
数据发现的全部意义在于去做那些计算机不擅长的人类技能:推理,创造力,学习,直觉,统筹不协调等。
Computers are fast but dumb, while humans are slow but smart.
计算机的速度很快但很笨,而人类却很慢但很聪明。
That doesn’t mean technology cannot be helpful.
但这并不意味着技术没有用。
Data Discovery Tools
数据发现工具
The business intelligence tool industry is pivoting as fast as it can to offer up data discovery tools.
商业智能工具行业正在越来越快的提供数据发现工具。
They describe their offerings in florid terms:
他们用florid术语描述他们的产品:
Imagine an analytics tool so intuitive, anyone in your company could easily create personalized reports and dynamic dashboards to explore vast amounts of data and find meaningful insights.
设想有一个非常直观易用的工具,利用它,公司中的任何人都可以轻松创建个性化报告和动态数据表,藉此挖掘大量数据并获得有意义的洞察。
(Qlik.com1)
http://Qlik.com/
Tableau enables people throughout an organization— not just superstar analysts—to investigate data to find nuances, trends, and outliers in a flash.
Tableau工具让整个公司的员工 – 不仅仅是那些超级明星分析师 – 能够探索数据,以发现数据的细微差别,趋势和异常值。
(Yes, the super- stars benefit, too.)
(是的,那些超级明星分析师也能获利。)
No longer constrained to a million rows of spreadsheet data or a monthly report that only answers a few questions, people can now interact and visualize data, asking—and answering—questions at the speed of thought.
不再局限于百万行表格数据或仅能回答几个问题的月度报告,人们现在可以与数据交互,将数据可视化,迅速的回复疑问。
Using an intuitive, drag-and-drop approach to data exploration means spending time thinking about what your data is telling you, not creating a mountain of pivot tables or filling out report requests.
以直观的拖放方式去进行数据探索,给了我们更多的时间去思考数据的本质,而不是花时间去创建一大堆数据透视表或者填写报告需求表。
(Tableau2)
Tableau
We help people make faster, better business decisions, empowering them with self-service tools to explore data and share insights in minutes….
我们帮助人们更快,更好的做出业务决策,给人们提供自助服务工具去很快的探索数据,分享洞察。
Simple drag-and- drop tools are paired with intuitive visualizations.
简单的拖放工具与直观的可视化是一对组合。
Connect to any data source and share your insights in minutes….
他们可以帮你连接到任何数据源并在几分钟内让你可以分享洞察。
Standalone data discovery tools will only get you so far.
独立的数据发现工具只会能达到这个目的。
Step into enterprise-ready ana- lytics and guarantee secure, governed data discovery.
使用企业级的分析工具并确保实现安全,受控制的数据发现。
(Microstrategy3)
(微观策略)
Regardless of the speed and agility of one technology or another, it all depends on the person driving the system to
无论这种或那种技术的速度和灵活性如何,这一切都取决于操作这个系统的人员可以
29
ask really good questions.
问出真正的好问题。
However, if the system does not have really good data, even the best questions will result in faulty insights.
但是,如果系统没有拥有好的数据,即使是最好的问题也会得出错误的见解。
Therefore, data hygiene takes precedent over superior query capability.
因此,数据清洁高于优秀的查询能力。
Data Hygiene
数据清洁
Garbage in, garbage out.
输入垃圾,就会输出垃圾。
So much goes into Big Data, it’s very hard to know which bits are worthy of being included and which need to be rectified.
大数据如此之多,很难知道哪些值得纳入,哪些位需要重新整理。
For that, you need a subject matter expert and a data matter expert.
为此,您需要一名主体主题专家和一名数据主题专家。
A data matter expert is knowledgeable about a specific stream: how it was collected, how it was cleansed, sampled, aggregated and segmented, and what transformation is required before blending it with other streams.
数据主体专家了解特定数据流:如何收集数据,如何清理,采样,聚合和分段,以及在将其与其他数据流混合之前需要进行哪些转换。
Data hygiene and data governance are paramount to ensure the digital analytics cooks are using the very best ingredients to avoid ruining a time-proven recipe.
数据清洁和数据治理对于确保数字分析“厨师
”使用最好的成分来避免破坏久经考验的配方至关重要。
Further, when the output of one analysis provides the input for the next (creating a dashboard, for example), transformation, aggregation and segmentation help obfuscate the true flavor of the raw material until it is past the ability of a forensic data scientists to track down the cause of any problems—supposing somebody is aware that there is a problem.
此外,当一个分析的输出为下一个分析提供了输入(例如创建数据表)时,转换,聚合和分割会掩盖原始数据的质量,削弱了数据科学家发现问题原因的能力- 如果有人意识到存在问题。
Yet, aggregations are as important to the insight supply chain as top-grade ingredients are to the five-star chef:
但是,数据聚合对于洞察链条仍然重要,因为优质成分对于五星级厨师而言:
[D]ata aggregations and summaries remain critical for supporting visual reporting and analytics so that users can see specific time periods and frame other areas of interest without getting overwhelmed by the data deluge.
数据聚合和概括对于支持可视化报告和可视化分析仍然至关重要,这样用户就可以关注特定的时间段并探索其他感兴趣的领域,而不会被数据海洋所淹没。
Along with providing access to Hadoop files, many modern visual reporting and data discovery tools enable users to create aggregations as the need arises rather than having to suffer the delays of req- uisitioning them ahead of time from IT developers.
除了可以对Hadoop文件的访问之外,许多现代的可视化报告和数据发现工具可以让用户能够只在需要时进行聚合,而不必承受因需要提前向IT开发人员打申请报告造成的延迟。
In a number of leading tools, this is accomplished through an integrated in-memory data store where the aggregations are done on the fly from detailed data stored in memory.
在许多先进的工具中,这个功能是通过集成的内存数据存储来实现的,聚合过程利用内存中数据完成的。
30
TDWI Research finds that enterprise data warehouses, BI reporting and OLAP cubes, spreadsheets, and ana- lytic databases are the most important data sources for visual analysis and data discovery, according to survey respondents.
根据调查中受访者的回答,TDWI Research发现,企业数据仓库,BI报告和OLAP多维数据集,电子表格和分析数据库是视觉分析和数据发现最重要的数据来源。
(TDWI4)
(TDWI)
The care and feeding of the raw material used in the data discovery process is even more important in light of the lack of five-star chefs.
鉴于缺乏五星级厨师,在数据发现过程中,仔细的处理原始数据更为重要。
As analytics becomes more accepted, demanded and democratized, more and more amateur analysts will be deriving conclusions from raw material they trust implicitly rather than understand thoroughly.
随着分析变得越来越被大众接受,所需求并变得大众化,越来越多的业余分析师将从他们信任的原始数据中得到结论,而不需要完整的了解数据。
Preparing for data illiterate explorers requires even more rigor than usual to guard against their impulse to jump to the wrong conclusions.
为不了解数据的分析员提供工具需要比平常更严格,来防止他们鲁莽的得出错误结论。
Asking Really Good Questions
问优质的问题
In the hands of a well-informed analyst, lots of data and heavy-lifting analytics tools are very powerful.
在消息灵通的分析师手中,许多数据工具和复杂的分析工具都非常强大。
Getting the most out of this combination takes a little bit of creativity.
充分利用这样的组合工具需要一点创造力。
Creativity means broadening your mental scope.
创造力意味着扩展您的内心空间。
Rather than seeking a specific answer, open yourself up to possibilities.
不仅仅是寻求具体问题的答案,更重要的是释放自己的无线潜力。
It’s like focusing on your peripheral vision.
这就像要注意你周边的环境。
Appreciate Anomalies
重视异常值
Whether you use visualization tools and “look for” things that go bump in the night, or you are adept at scanning a sea of numbers and wondering why it looks out of balance, the skill to hone is the art of seeing the
无论你是使用可视化工具去“寻找”夜间的亮点,或者你擅长在数字海洋中研究,并想知道数据为什么它看起来不正常,培养的技巧就是一门
out-of-the-ordinary.
发现异常情况的艺术。
Outliers, spikes, troughs—any anomaly—are our friends.
异常值,尖峰,低谷 – 任何异常 – 都是我们的朋友。
They draw our attention to that which is not like the others and spark the intellectual exercise of wondering “Why?”
他们把注意力吸引到这些值与其他值不同的地方,引发了对“为什么会这样?”的思考。
What is it about this element that makes it point in a different direction?
这个因素到底是什么?它为什么指向另一个方向。
Could it be some error in the collection or transformation of the underlying data?
是我们在收集或转换基础数据时出现了一些错误吗?
Is it a function of how the report was written or the query was structured?
问题是出在报告编写的方式上或查询构造上吗?
Or does it represent some new behavior/market movement/customer trend?
或者它代表了一些新的行为/市场变动/客户趋势?
It is in the hunt for the truth about these standouts that we trip over the serendipitous component that spawns a new
正是在寻求解答这些异常情况的答案中,我们偶然碰到了一些新情况,
31
question and another dive down the rabbit hole.
产生了一个新的问题,想探寻个究竟。
The secret is knowing when to stop.
秘诀是知道何时该停止。
One can easily get lost in a hyperlink-chasing “research session” and burn hours with very little to show for it.
人们很容易迷失在寻找“研究”答案的谜团中,并且花了很多时间却没得到什么成果。
Following the scent of significance is an art and one that takes practice and discipline.
找到关键所在是一门艺术,它同时需要实践经验和规则。
Many scientists spend a career pursuing a specific outcome only to find it disproved.
许多科学家穷极一生去证明某个结论,最终却发现它是错的。
Others stop just short of a discovery because they lose heart.
其他人则毫无发现因为他们没有雄心。
The magic happens between those two points.
只有在两个极端之间,奇迹才会发生。
Give in to the temptation to slice the data one more time or to cross reference results against just one more query, but be vigilant that you are not wasting valuable cycles on diminishing returns.
我知道你还想继续对数据进行切分或者想再进行一次交叉查询,但要警惕的是,不要在边际收益递减时继续花费您宝贵的时间。
If you don’t see what you expect to see, work your hardest to understand why.
如果您没有看到您期望看到的结果,请花些时间去问问为什么。
It may be that you do not have enough facts.
可能是你没有足够的案例。
It might be that you have already, unknowingly, come to a conclusion or formed a pet theory without all the facts.
可能是你不顾所有事实,已经潜意识的得出结论或构造了你偏爱的理论。
It might be—and this is the most likely—that there is something afoot which you have not yet considered.
可能是 – 而且这是最有可能的 – 你遗漏了一些东西。
Dig deeper.
再深挖一下。
Ask, “I wonder …
问,“我想知道……
.”
And be cognizant of that which is conspicuous in its absence.
并且要注意那些即使缺乏事实的情况下也显而易见的事物。
Gregory (Scotland Yard detective):
格雷戈里(伦敦警察厅侦探):
“Is there any other point to which you would wish to draw my attention?”
“你还有什么其他线索可以引起我的注意吗?”
Holmes:
霍姆斯:
“To the curious incident of the dog in the night-time.”
“狗在夜间的表现很奇怪。”
Gregory:
格雷戈里:
“The dog did nothing in the night-time.”
“狗在夜间什么也没做。”
Holmes:
霍姆斯:
“That was the curious incident.”
“这本身就是一个奇怪的事情。”
Sir Arthur Conan Doyle, Silver Blaze
爵士Arthur Conan Doyle, 银色火焰
As a corollary, be wary of the homologous as well:
在推论中,同样要警惕同源问题:
Exhibiting a degree of correspondence or similarity.
1.表现出一定程度的协同性或相似性。
Corresponding in structure and evolutionary origin, but not necessarily in function.
2.在结构和进化起源上类似,而不一定在功能上。
For example, human arm, dog foreleg, bird wing, and whale flipper are homologous.
例如,人类的手臂,狗的前腿,鸟的翅膀和鲸鱼的翅膀是类似的。
(A Word A Day5)
(A Word A Day)
32
Things that are unusually similar are equally cause for alarm as standouts.
非常相似的东西会引起相同的警报。
If everybody in your cohort looks the same, there’s something funny going on and it’s worth an investigation.
如果你数据中的每个人都看起来都一样,那么就值得去寻找一下原因。
It may be that their similarity is a statistical anomaly.
产生这些相似性的原因可能是统计异常。
Savor Segmentation
品味细分
People (thank heaven!) are different.
人们(谢天谢地!)是不同的。
We make a huge mistake when we lump them all together.
如果当我们将所有的人混为一谈时,我们就是犯了一个巨大的错误。
But we cannot treat them as individuals—yet.
但我们也不能细分成每一个个体。
Peppers and Rogers’
“Peppers and Rogers”中提到的
One to One Future is not yet upon us.
“一对一的未来”尚未出现在我们面前。
In between lies segmentation.
两者之间就是细分。
It almost doesn’t matter how you segment your customers (geographically, chronologically, by hair color).
无论你怎样去细分顾客(按地理位置,时间顺序,按头发颜色)。
Eventually, you will find traits that are useful in finding a cluster of behavior that can be leveraged to your advantage.
最终,你会找到一组特征,这些特征帮你定义一群相同的行为,进而增强你的优势。
People who come to our website in the morning are more likely to X.
早上来我们网站的人更有可能是X.
People who complain about us on social media respond better to message Y.
在社交媒体上抱怨我们的人对消息Y的反应更好。
People who use our mobile app more than twice a week are more likely to Z.
每周使用我们的APP超过两次的人更有可能Z.
When it comes to segmenting customers by behavior, Bernard Berelson pretty much nailed it in his “Human Behavior:
当涉及到按行为细分客户时,伯纳德贝尔森在他的著作
An Inventory of Scientific Findings”6 where he said:
“人类行为:科学发现汇总”说:
Some do and some don’t.
有些人是,有些则不是。
The differences aren’t that great.
差异并不是那么大。
It’s more complicated than that.
它很复杂。
When you’re trying to get the right message in front of the right people at the right time and on the right device, segmentation may likely be the key to the mystery.
当您尝试在正确的时间,正确的设备,正确的人面前推送正确的信息时,细分可能实现这个流程的关键。
Don’t Fool Yourself
不要欺骗自己
While working with data is reassuring—we are, after all dealing with facts and not opinions—we are still human and still faced with serious mental handicaps.
虽然与数据工作令人放心 – 但毕竟我们是与事实打交道,而不是和意见- 我们仍然是人,仍然面临着严重的内心障碍。
33
Being open-minded and objective are wonderful goals, but they are not absolute.
保持开放的心态,保持客观是很好的心态,但也并不是绝对的。
Cognitive biases are inherited, taught, and picked up by osmosis in a given culture.
在特定的文化环境中,潜移默化的遗传,教导和学习影响了我们的认知偏差。
In short, your mind can play tricks on you.
简而言之,你的想法有时会和你开玩笑。
While this is too large a subject to cover in depth here, there are some examples that make it clear just how tenuous your relationship with “the facts” might be.
虽然这是一个非常广泛的主题,这里很难深入的讨论,不过这里还是有一些例子,可以清楚地表明你认为的“事实”可能有多么脆弱。
Familiarity Bias
熟悉偏差
I’ve worked in television advertising all my life and I can tell you without any doubt that it’s the most powerful branding medium there is.
我一生都在电视广告行业,我可以毫无疑问地告诉你,电视广告是最强大的品牌传播媒介。
Hindsight or Outcome Bias
后见之明或结果偏差
If they’d only have asked me, I would have told them that the blue button would not convert as well as the red one.
如果他们问了我的话,我会告诉他们蓝色按钮不会像红色按钮那样转换的好。
It was obvious all along.
这是很明显的。
Attribution Bias
归因偏差
Of course I should have turned left at that light.
当然我应该在那个灯那儿左转。
But I was distracted by the sun in my eyes and the phone ringing.
但我的眼睛被太阳照到了,而且电话铃声分散了我的注意力。
That other guy missed the turn because he’s a dim-wit.
另一个人错过了转弯,因为他是一个笨蛋。
Representativeness Bias
代表性偏差
Everybody who clicks on that link must be like every- body else who clicked on that link in the past.
点击该链接的每个人的表现肯定和过去点击这个链接的人一样。
Anchoring Bias
锚定偏差
That’s far too much to pay for this item.
这个东西的价格太高了。
The one next to it is half the price.
旁边的那个是只是它的一半价格。
Availability Bias (the first example that comes to mind)
可得性偏差(过分依赖第一个出现在脑中的例子)
That’ll never work—let me tell you what happened to my brother-in-law .
这永远不会奏效 – 让我告诉你我的姐夫发生了什么事。
.
.
Bandwagon Bias
随波逐流的偏差
We should run a Snapchat campaign because everybody else is doing it.
我们应该开展一个Snapchat推广活动,因为其他人都在这样做。
Confirmation Bias
确认偏差
I’m a conservative, so I only watch Fox News.
我是保守派,所以我只看福克斯新闻。
I’m a liberal, so I only watch The Rachel Maddow Show.
我是一个自由主义者,所以我只看The Rachel Maddow Show。
I’ve been in advertising all my life, so I count on Nielsen, Hitwise, and comScore.
我一生都在做广告,所以我信赖尼尔森,Hitwise公司和comScore。
www.allitebooks.com
http://www.allitebooks.org/
34
I started out grepping log files, so I only trust my Core- metrics/Omniture/Webtrends numbers.
我开始搜索log文件,所以我只相信Core-metrics / Omniture / Webtrends的数字。
Projection Bias
映射偏差
I would never click on a product demo without a long list of testimonials, so we can assume that’s true of everybody else.
如果没有长长的推荐列表,我永远不会点击产品演示,我们可以假设其他人也会这么做。
Expectancy Bias
期望偏差
Your report must be wrong because it does not show the results I anticipated.
您的报告肯定是错误的,因为它不是我预期的结果。
Normalcy Bias
常态偏见
Back-ups?
备选方案?
We’ve never had a data loss problem yet, I don’t see it happening this quarter so we won’t have to budget for it.
我们从来没有遇到过数据丢失问题,我不认为这个季度会发生这种情况,所以我们不必为此准备预算。
Semmelweis Reflex
塞默尔维斯反射(墨守成规)
I don’t care what your numbers say, we’ve always had better conversions from search than social media so we’re not going to change our investment.
我不在乎数据说什么,我们从搜索获得的转化优于社交媒体,所以我们不会改变我们的预算安排。
If any of the above sound familiar, congratulations— you’ve been paying attention.
如果上述任何一个偏差听起来很熟悉,祝贺你 – 你一直学术研究的聚光灯下。
The hard part is convincing others that there may be a cognitive problem.
现实中很难让别人也意识到这些偏差的存在。
Correlation versus Causation
相关关系与因果关系
While frequently mentioned, it cannot be stressed enough that just because drownings go up when ice cream sales go up, one did not cause the other.
我们经常提到的例子,再强调也不过分:虽然当冰淇淋销售额上升时溺水事件也上升,但前者并不是后者发生的原因。
Most recently, a Swedish study (“Allergy in Children in Hand Versus Machine Dishwashing”7) concluded,
最近,一项瑞典研究(“儿童的手过敏与洗碗机”)得出结论
“In families who use hand dishwashing, allergic diseases in children are less common than in children from families who use machine dishwashing,” and speculated that,
“在使用手洗餐具的家庭中,儿童的过敏性疾病比使用洗碗机的家庭少,”并推测,
“a less-efficient dishwashing method may induce tolerance via increased microbial exposure.”
“效率较低的洗碗方法可能会通过增加微生物暴露水平来产生过敏耐受性。”
While the study asked a great deal of questions about the types of food they eat, food preparation, parental smoking, etc., there are simply too many other variables at play
虽然这项研究询问了很多问题,涉及了他们所吃的食物类型,食物的准备过程,父母吸烟等等,
for this cause to be solely responsible for that effect.
但是还有很多的变量会影响小孩的过敏。
How many other similarities are there among families that have dishwashers vs. those that do not?
有洗碗机的家庭与没有洗碗机的家庭有多少其他相似之处呢?
35
Correlations are a wonderful clue, but they must be treated as clues and not results.
相关性是一个很好的线索,但它们只能被视为线索,而不是结果。
Correlations are the stimulus for seeking a cause, not the end of the story.
相关性促使我们去寻找故事后面的原因,它本身并不是故事的结局。
Communicating Carefully
仔细沟通
Coming up with a fascinating correlation and proving a causative relationship can be exciting.
提出一个吸引人的相关性,并证明一个因果关系是令人兴奋的。
The thrill of the chase, the disappointment of a miscalculation, and the redemption of the correction make for an invigorating career, but like your latest round of golf, not necessarily a great story at the dinner table.
奋斗中的快感,计算错误带来的失望以及纠正错误的兴奋都让你的事业充满活力,但就像你最近打的一轮高尔夫,这些过程并不是可以在餐桌上高谈阔论的精彩故事。
And certainly not at the conference room table or across the desk from an executive who is trying to make a multimillion-dollar advertising decision.
当然更不可能是放在会议室的桌子上,或掌握者百万美元广告预算的高管的桌子上,进行深入谈论的材料。
This is the time to stick with what you know, not how you got there.
现在是时候关注你所知道的东西,而不是你如何到达那里的过程了。
The most important part of your performance when delivering insights based on data is to avoid any bravado of certainty.
基于数据提供洞察的时候,最重要的是不要虚张声势。
You have not been asked to audit the books and come to a conclusion.
你并没有被要求去审查书籍并得出结论。
You have not been tasked with adding up a row of numbers and delivering The Answer.
你并没有被要求添加一行数字,然后提交答案。
Instead, you have been asked to sift through the data to see if there’s anything in there that might be directional.
相反,你被要求去筛查数据,去看看能否从数据中得到一些指导性的东西。
To assure everybody else that you understand your responsibility and to appropriately frame your findings in terms that will lead to a valuable conversation and business decision, monitor your language carefully.
为了让所有人知道你了解你自己的责任,为了以恰当的方式组织你的发现,以便其他人可以进行有价值的沟通,做出有价值的商业决策,请注意你的沟通语言。
The data suggests .
数据显示。
.
.
It seems more likely .
似乎更有可能。
.
.
One could conclude .
可以得出结论。
.
.
Based on the data, it feels like .
根据数据,似乎是。
.
.
If I were placing bets after seeing this .
如果我在看到这个之后做决定。
.
.
Remember that you are looking into a crystal ball that is a complete mystery to the business side of the house and you are telling them things about a subject they know very well, just not through that lens.
请记住,你正在观察一个水晶球,房子另外一边的业务人士对此完全不懂,你需要从另外一个视角给他们介绍一些主题信息,这个主体是他们非常了解的。
They know advertising
他们了解广告
36
and marketing inside and out and are going to be incredulous if you make pronouncements that are contrary to their experience, gut feel, and common sense.
和营销的方方面面,如果你说的与他们的经验,直觉和常识相悖,他们会持怀疑态度。
The domain expert can look at a carefully scrutinized, statistical revelation and roll their eyes.
某个领域的专家可以看看一些仔细审查的统计结论,一定会表示不屑。
“Of course movies starting with the letter A are more popular—we list them alphabetically.”
“当然,以字母A开头的电影更受欢迎 – 因为我们按字母顺序在问卷上列出它们。”
“Of course online sales took a jump the week in that region—there was a five day blizzard.”
“当然,该地区的在线销售额那周都有所增长 – 因为上一周有五天的暴风雪。”
“Of course we sold more low-end laptops that day—our competitor’s website was down.”
“当然,我们当天卖掉了更多的低端笔记本电脑 – 因为我们的竞争对手的网站已经关闭了。”
Be sure to sound more like the weather prognosticator who talks about a chance of showers.
要保证听起来更像天气预报员谈论未来会下阵雨的机会那样。
Use the vernacular or the gambler running the odds.
使用大白话或者赌博的话语。
Think in terms of a Probability Line [Figure 1.4] and choose your words accordingly.
根据概率图思考并选择对应的词汇。
Follow the lead of doctors who talk about relative health risks.
就像医生谈论某个疾病的可能风险时那样开始对话。
And then, draw them into the supposition process.
然后,将他们引入到假设的情境。
Doesn’t that seem logical?
这看起来不合逻辑吗?
Does that meet or challenge your thoughts?
这与你的想法冲突还是契合?
Do you think it means this or that?
你觉得这意味着这样还是那样?
It shouldn’t take long to get them to see you as an advisor and not a report writer.
过不了多久,他们就会更乐意把你当成顾问而不是报告撰写者。
Impossible Unlikely
Even Chance
Likely Certain
1-in-6 Chance 4-in-5 Chance
Figure 1.4 The spectrum of probability (Math Is Fun31)
概率谱(数学是有趣的)
37
Become a Change Agent
The very best way to win the hearts and minds of those who can benefit the most by your flair for data discovery is to educate them.
那些从你的数据发现天赋中获益最多的人,赢得他们同意和赞许的最好的办法就是去教导他们。
The more people in your organization who understand the ways and means of data exploration as well as the associate risks and rewards, the more they will come to you for answers, include you in planning sessions and support your calls for more data, people and tools.
你公司中了解数据探索的方法,风险和收益的人越多,他们就会更多的向你寻求答案,也包括支持你获取更多的数据,人手和工具。
Start by inviting them to lunch.
首先邀请他们共进午餐。
Ask them to bring their best questions about The Data.
请他们尽情的提出数据有关的疑问。
Encourage those who would rather not be seen as ill-informed to submit their questions in advance.
鼓励那些不愿被视为孤陋寡闻的人提前提交问题。
Prepare a handful of questions that you wish they would ask.
准备一些你希望他们会问到的问题。
Answer their questions.
回答他们的问题。
Show them examples of quick- wins enjoyed by other projects in other departments.
向他们展示其他部门的项目所获得的快速获益的例子。
Share case studies from vendors about successes at other companies.
分享供应商那儿其他公司成功案例的研究报告。
Engage your audience in the excitement of the chase with a simple data set and a common challenge.
通过简单的数据集和一些共同的问题挑战,让您的观众激情踊跃参与。
If you can teach them how to ask great questions by example and by exercise then you can change how they approach data—to see it as a tool instead of an accusation.
如果您可以通过示例和练习教他们如何提出重要的问题,那么您就可以改变他们处对待数据的方式 – 将其视为工具而不是需求。
And be sure to feed them.
一定要培养他们的这种能力。
This is a case where a free lunch will pay off handsomely.
这就是一个免费午餐丰厚回报的例子。
Your Job as Translator
你的工作就像翻译
You know your data inside and out, but the consumers of your insights, who must depend on your recommenda- tions do not.
您了解数据的里里外外,但是依赖你提供数据洞察的用户却不了解。
To them, your data is as readable as a crystal ball or a sequence of Tarot cards.
对他们而言,您的数据就像巫师的解读水晶球或解读塔罗牌。
That means they are putting their trust in you.
这意味着他们信任你。
Therefore, your responsibility is to inform without confusing, to encourage without mystifying and to reassure without resorting to sleight of hand.
因此,你的责任是明确的告知,清楚的鼓励,不甩小聪明去让人们信服。
Entice and enthrall your Truth Seekers with The Data and The Correlations, but make sure your confidence levels are high and be prepared to show your work.
利用数据和相关性来吸引那些追求真相的人,但要确保你很有信心并且已经准备好展示你的工作能力了。
38
Conclusion
结论
Successful data discovery requires good tools (technology) and trustworthy raw material (clean data), but depends on the creativity of the data detective.
成功的数据发现需要很好的工具(技术)和值得信赖的原材料(干净的数据),同时也依赖于数据侦探的创造力。
The best analyst has the ability to manipulate data in a variety of ways
优秀的分析师能够以各种方式分析数据,
to tease out relevant insights.
并梳理出相关的洞察。
With the goals of the organization firmly in mind, top analysts engage the data in a conversation of What-Ifs, resulting in tangible insights that can be used to make decisions by those in charge.
把组织的目标牢记在心中,顶级分析师将这些数据假设的模型,从而得到具体的洞察供高层做决策之用。
The analyst, as consulting detective, becomes indispensable.
分析师,就是咨询侦探,变得不可或缺。
NOTES
备注:
Self-Service Data Discovery and Visualization Application, Sense BI Tool | Qlik, available at http://
自助数据发现和可视化应用程序,Sense BI Tool | Qlik,链接:http:// www.qlik.com/us/explore/products/sense
www.qlik.com/us/explore/products/sense, last accessed on 3/13/15.
http://www.qlik.com/us/explore/products/sense
http://www.qlik.com/us/explore/products/sense
Data Discovery | Tableau Software, available at http://
数据发现| Tableau软件 链接:http:// www.tableau.com/solutions/data-discovery
www.tableau.com/solutions/data-discovery, last accessed on 3/13/15.
http://www.tableau.com/solutions/data-discovery
http://www.tableau.com/solutions/data-discovery
Features of the Analytics Platform | MicroStrategy, available at http://www.microstrategy.com/us/ analytics/features, last accessed on 3/13/15.
分析平台特征 | MicroStrategy 链接:http://www.microstrategy.com/us/ analytics / features
http://www.microstrategy.com/us/analytics/features
http://www.microstrategy.com/us/analytics/features
Data Visualization and Discovery for Better Business Decisions, available at http://www.adaptiveinsights
数据可视化和发现,以实现更好的业务决策。链接:http://www.adaptiveinsights .com / uploads / news / id421 / tdwi_data_visualization_ discovery_better_business_decisions_adaptive_insights .pdf
http://www.adaptiveinsights.com/uploads/news/id421/tdwi_data_visualization_discovery_better_business_decisions_adaptive_insights.pdf
.com/uploads/news/id421/tdwi_data_visualization_ discovery_better_business_decisions_adaptive_insights
http://www.adaptiveinsights.com/uploads/news/id421/tdwi_data_visualization_discovery_better_business_decisions_adaptive_insights.pdf
http://www.adaptiveinsights.com/uploads/news/id421/tdwi_data_visualization_discovery_better_business_decisions_adaptive_insights.pdf
.pdf, last accessed on 3/13/15.
http://www.adaptiveinsights.com/uploads/news/id421/tdwi_data_visualization_discovery_better_business_decisions_adaptive_insights.pdf
A.Word.A.Day—homologous, available at http://
A.Word.A.Day同源,链接:http:// wordsmith.org/words/homologous.html
wordsmith.org/words/homologous.html, last accessed 3/13/15.
http://wordsmith.org/words/homologous.html
http://wordsmith.org/words/homologous.html
Human Behavior:
An Inventory of Scientific Findings, available at http://home.uchicago.edu/aabbott/ barbpapers/barbhuman.pdf, last accessed 3/13/15.
人类行为:科学发现的清单。链接:http://home.uchicago.edu/aabbott/ barbpapers / barbhuman.pdf
http://home.uchicago.edu/aabbott/barbpapers/barbhuman.pdf
http://home.uchicago.edu/aabbott/barbpapers/barbhuman.pdf
Allergy in Children in Hand Versus Machine Dishwashing, available at http://pediatrics
儿童过敏与洗碗机,链接:http://pediatrics
.aappublications.org/content/early/2015/02/17/peds.2014-2968.full.pdf,
http://pediatrics.aappublications.org/content/early/2015/02/17/peds.2014-2968.full.pdf
.aappublications.org/content/early/2015/02/17/peds
http://pediatrics.aappublications.org/content/early/2015/02/17/peds.2014-2968.full.pdf
.2014-2968.full.pdf, last accessed 3/13/115.
http://pediatrics.aappublications.org/content/early/2015/02/17/peds.2014-2968.full.pdf
Variety of Data Is the Spice of Life
数据变量是生活的调味品
Machine learning differs from data diving.
机器学习与数据挖掘不同。
It is like putting tens of thou- sands of statisticians in a black box and throwing in a question.
这就像将数以万计的统计学家放入一个黑匣子并提出一个问题。
They
这些
39
will scour through the data in different ways, confer, and then pop out an answer along with their degree of confidence.
统计学家通过不同的方式去探索数据,对比,然后基于某个置信水平给出答案。
Next, they will test their answer against some fresh information and adjust their opinion.
接下来,他们将根据一些新的信息检验他们的答案,并调整他们的结果。
The more data you let them look at, and the more they cycle their assumptions against real-world results, the better.
您给他们的数据越多,他们将假设与实际结果的验证次数越多,结果也越好。
With the price of storage in a downward spiral to almost nothing and the speed of processing continuing to increase thanks to paral- lel processing in the cloud, we can crunch through a great deal more information than ever.
随着存储数据的成本下降到很低的水平,云服务器的并行运算使得运算速度飞速提升,我们可以比以往掌握更多的数据。
Machine learning is good with lots of data, but it really goes to town when it has lots of different types of data to play with.
拥有大量数据对机器学习来说是非好的,不过它真正需要的是很多不同类型的数据,这样才能发挥最大效用。
It can find correlations between attributes humans wouldn’t even consider comparing.
它可以发现相关关系,有些可能是人类根本不会想到的相关关系。
If there is a relationship among the weather, the color of socks a prospect is wearing, and what the prospect had for lunch, then marketers can leverage that correlation.
如果天气,客户穿着的袜子的颜色,吃午餐的可能性三者之间存在关联,那么营销人员可以利用这种相关性。
It doesn’t matter if the correlation is logical or even understandable, it only matters that it is actionable.
这种相关性是否在逻辑上合理并不重要,重要的是这种相关性有可执行性,可以产生收益。
In addition to all the digital interaction data that drove the whole Big-Data-Hadoop-Clusters-in-the-Cloud movement, now there’s even more data to chew on out there.
除了驱动整个大数据,hadoop,云计算运转的数字交互数据之外,现在还有更多的数据需要我们消化。
Open Data
公开数据
Hundreds of organizations, both governmental and NGOs, are pub- lishing a shockingly large amount of data that might be useful in finding your next customer.
数以百计的政府和非政府组织正在发布大量可能有助于找到你的下一位客户的数据。
Just think about all the APIs (applica- tion program interfaces) that allow you to grab onto firehoses like Facebook and Twitter.
想想所有的API(应用程序界面),它们可以让你像Facebook和Twitter一样分析数据。
Facebook Likes alone can predict quite a bit about you as an individual, according to a paper from the Psychomet- rics Centre, University of Cambridge.32 “Facebook Likes, can be used to automatically and accurately predict a range of highly sensitive personal attributes, including: sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender.”
根据剑桥大学心理中心的一篇论文,Facebook的赞可以预测一个人的信息.“Facebook的赞,可以用来自动的准确地预测一系列高度隐私的个人信息,包括:性取向,种族,宗教和政治观点,人格特质,智力,幸福,成瘾物质的使用,父母离异,年龄和性别。“
Think about all the recipes you can get from Campbell’s Soup:33
想想你可以从坎贝尔的汤中获得的所有食谱:
The Campbell’s Kitchen API was developed to share information from Campbell’s Kitchen.
Campbell’s Kitchen 开发API就是为了分享Campbell’s Kitchen的信息。
This information includes thousands of recipes using brands like Campbell’s®, Swanson®, Pace®, Prego®, & Pepperidge Farm®—brands people love, trust, and use every day.
这些信息包含数千个食谱,用到了Campbell’s®,Swanson®,Pace®,Prego®和PepperidgeFarm®等品牌 – 这些人们每天都喜欢,信任和使用的品牌。
The easier people can find those recipes, the less time they have to spend worrying about what to make for dinner.
人们可以越容易地找到这些食谱,他们就可以花费更少的时间来思考晚餐吃什么。
40
We hope you will use this information to develop smart and simple ways to help people get the dinner and entertaining ideas they’re looking for.
我们希望您能利用这些信息来发掘聪明而简单的方法,帮助人们获得他们正在寻找的晚餐和娱乐创意。
GET ACCESS TO:
获取:
Thousands of proven family favorite recipes
成千上万的家庭喜欢的食谱
Extensive recipe filtering by key ingredients, product UPC, keywords and more
利用关键成分,产品UPC,关键字等进行配方筛选
Professional food photography
专业美食摄影
Reader-generated recipe reviews & comments
读者撰写的食谱评论
Recipe search results through superior tagging
通过高级标签搜索配方
Well-known food brands people know and trust SO MANY POSSIBILITIES:
人们了解和信任的知名品牌 很多可能性:
Enhance websites with related recipes & delicious look- ing photographs
使用相关食谱和美味的照片来优化网站
Create food-related apps (for websites and the latest and greatest devices and toys) and helpful shopping and cooking tools
创建食品相关的APP(用于网站,最新最好的设备和玩具)以及有用的购物和烹饪工具
Augment social media sites like Facebook, Twitter, & Google+
增加Facebook,Twitter和Google+等社交媒体网站分享
Raise visibility for your brand
提高品牌的知名度
Drive more traffic to your site and gain new readers from a wider audience
为您的网站带来更多流量,从人群中吸引更多的读者
The sky’s the limit
没有上限
Imagine cross-referencing the people who comment on recipes with their social media accounts to target people by flavor prefer- ences.
想象一下,参考人们在社交媒体上对于食谱的评论,通过味道偏好来定向覆盖人群。
But that’s just the tip of the iceberg.
但这只是冰山一角。
Google hosts a growing number of data sets that are directly accessible through its BigQuery utility.34
Google托管了越来越多可通过其BigQuery程序直接访问的数据集。
BigQuery is a fully managed data warehouse and analytics platform.
BigQuery是一个完全托管的数据仓库和分析平台。
The public datasets listed on this page are available for you to analyze using SQL queries.
此页面上列出的公共数据集你都可以使用SQL语句进行分析。
You can access BigQuery public data sets using the web UI, the command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java,
您可以使用Web UI,命令行工具或使用各种客户端库(如Java,.NET或Python)调用BigQuery REST API来访问BigQuery公共数据集。
.NET, or Python.
41
The first terabyte of data processed per month is free, so you can start querying datasets without enabling billing.
每月1TB免费数据处理,因此您可以在不花钱的情况下查询数据集。
To get started running some sample queries, select or create a project and then run the example queries on the NOAA GSOD weather dataset.
想运行一些示例查询,请选择或创建项目,然后在NOAA GSOD天气数据集上运行示例查询。
GDELT Book Corpus
GDELT Book Corpus
A dataset that contains 3.5 million digitized books stretching back two centuries, encompassing the com- plete English-language public domain collections of the Internet Archive (1.3 M volumes) and HathiTrust (2.2 million volumes).
一个包含350万个数字化图书的数据集,可追溯到两个世纪以前的全英文公共域书籍,包括互联网档案馆(130万册)和HathiTrust数据图书馆项目(220万册)。
GitHub Data
GitHub Data
This public dataset contains GitHub activity data for more than 2.8 million open source GitHub reposi- tories, more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files.
该公共数据集包含超过280万个开源GitHub资料库,超过1.45亿个独立提交,超过20亿个不同的文件目录,以及1.63亿个最新文件版本。
Hacker News
Hacker News
A dataset that contains all stories and comments from Hacker News since its launch in 2006.
一个数据集,其中包含自2006年推出以来Hacker News的所有故事和评论。
IRS Form 990 Data
IRS Form 990 Data
A dataset that contains financial information about non- profit/exempt organizations in the United States, gath- ered by the Internal Revenue Service (IRS) using Form 990.
包含美国非盈利/豁免组织的财务信息的数据集,由美国国税局(IRS)利用990表格收集。
Medicare Data
医疗保险数据
This public dataset summarizes the utilization and pay- ments for procedures, services, and prescription drugs provided to Medicare beneficiaries by specific inpa- tient and outpatient hospitals, physicians, and other suppliers.
该公共数据集汇集了由一些特定的住院和门诊医院,医生和其他供应商向Medicare福利机构提供的流程,服务和处方药的使用和支付信息。
Major League Baseball Data
美国职业棒球大联盟数据
This public dataset contains pitch-by-pitch activity data for Major League Baseball (MLB) in 2016.
该公共数据集包含2016年美国职业棒球大联盟(MLB)的每个活动数据。
NOAA GHCN
NOAA GHCN
This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes climate summaries from land surface stations across the globe that have been subjected to a common suite of quality assurance reviews.
该公共数据集由美国国家海洋和大气管理局(NOAA)创建,包括全球陆地站的气候摘要信息,这些气候摘要信息已经过一系列常规的质量保证审查。
This dataset draws from more
该数据集有超过
42
than 20 sources, including some data from every year since 1763.
20多个来源,包括一些自1763年以来每年的数据。
NOAA GSOD
NOAA GSOD
This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes global data obtained from the USAF Climatology Center.
该公共数据集由美国国家海洋和大气管理局(NOAA)创建,包括从美国空军气候中心获得的全球数据。
This dataset covers GSOD data between 1929 and 2016, collected from over 9000 stations.
该数据集涵盖了1929年至2016年间从超过9000个站点收集的GSOD数据。
NYC 311 Service Requests
NYC 311服务请求
This public data includes all 311 service requests from 2010 to the present, and is updated daily.
此公共数据包括从2010年到现在的所有311个服务请求,并且每天都在更新。
311 is a non-emergency number that provides access to non-emergency municipal services.
311是非紧急需求号码,用于获取非紧急市政服务。
NYC Citi Bike Trips
NYC Citi Bike Trips
Data collected by the NYC Citi Bike bicycle shar- ing program, that includes trip records for 10,000 bikes and 600 stations across Manhattan, Brooklyn, Queens, and Jersey City since Citi Bike launched in September 2013.
纽约Citi Bike自行车共享计划收集的数据,包括自2013年9月计划推出以来曼哈顿,布鲁克林,皇后区和泽西市的10,000辆自行车和600个车站的旅行记录数据。
NYC TLC Trips
NYC TLC Trips
Data collected by the NYC Taxi and Limousine Com- mission (TLC) that includes trip records from all trips completed in yellow and green taxis in NYC from 2009 to 2015.
纽约市出租车和豪华轿车公司(TLC)收集的数据,包括2009年至2015年在纽约市黄色和绿色出租车所有的旅行记录。
NYPD Motor Vehicle Collisions
NYPD机动车事故
This dataset includes details of Motor Vehicle Collisions in New York City provided by the Police Department (NYPD) from 2012 to the present.
该数据集包括警察局(NYPD)从2012年至今提供的纽约市机动车事故的详细信息。
Open Images Data
公开图像数据
This public dataset contains approximately 9 mil- lion URLs and metadata for images that have been annotated with labels spanning more than 6,000 cate- gories.
此公共数据集包含大约9百万个图像的URL和元数据,这些图像已有超过6,000个类别的标签注释。
Stack Overflow Data
Stack Overflow Data
This public dataset contains an archive of Stack Over- flow content, including posts, votes, tags, and badges.
此公共数据集包含Stack Over-fl ow内容的存档,包括帖子,投票,标签和徽章。
USA Disease Surveillance
USA Disease Surveillance
A dataset published by the U.S. Department of Health and Human Services that includes all weekly surveillance reports of nationally notifiable diseases for all U.S. cities and states published between 1888 and 2013.
由美国卫生和公众服务部发布的数据集,其中包括1888年至2013年期间发布的所有美国城市和州的所有每周监测的全国性疾病。
43
USA Names
USA Names
A Social Security Administration dataset that contains all names from Social Security card applications for births that occurred in the United States after 1879.
社保安全管理数据集,其中包含1879年以后在美国出生的社会保障卡申请上的所有姓名。
In its top-20 list of the best free data sources available online, Data Science Central includes:35
排名前20的最佳免费网上数据源:
Data.gov.uk, the UK government’s open data portal including the British National Bibliography—metadata on all UK books and publications since 1950.
Data.gov.uk,英国政府的开放数据门户网站,包括英国国家书目 – 自1950年以来所有英国书籍和出版物的元数据。
Data.gov.
Data.gov.
Search through 194,832 USA data sets about topics ranging from education to Agriculture.
可以搜索194,832个美国数据集,内容涉及下至农业,上至教育等领域。
US Census Bureau latest population, behaviour and economic data in the USA.
美国人口普查局,美国的最新人口统计,行为和经济数据。
Socrata—software provider that works with governments to provide open data to the public, it also has its own open data network to explore.
Socrata-一家与政府合作向公众提供开放数据的软件提供商,它也有自己的开放数据网络供公众获取数据。
European Union Open Data Portal—thousands of datasets about a broad range of topics in the European Union.
欧盟开放数据门户 – 涵盖欧盟广泛主题的数千个数据集。
DBpedia, crowdsourced community trying to create a public database of all Wikipedia entries.
DBpedia,众包社区,试图创建所有维基百科条目的公共数据库。
The New York Times—a searchable archive of all
纽约时报 – 1851年至今
New York Times articles from 1851 to today.
所有“纽约时报”文章都可搜索。
Dataportals.org, datasets from all around the world collected in one place.
Dataportals.org,世界各地的数据集汇总。
http://Dataportals.org/
http://Dataportals.org/
The World Factbook information prepared by the CIA about, what seems like, all of the countries of the world.
世界概况信息,由美国中央情报局编写,似乎涵盖了世界上所有国家。
NHS Health and Social Care Information Centre datasets from the UK National Health Service.
来自英国国家健康服务中心提供的NHS健康和社会关怀信息中心数据集。
Healthdata.gov, detailed USA healthcare data covering loads of health-related topics.
Healthdata.gov,详细的美国医疗数据,涵盖了大量与健康相关的数据主题。
UNICEF statistics about the situation of children and women around the world.
UNICEF儿童基金会-关于世界各地儿童和妇女状况的统计数据。
World Health organisation statistics concerning nutrition, disease and health.
世界卫生组织-有关营养,疾病和健康的统计数据。
Amazon web services’ large repository of interesting datasets including the human genome project, NASA’s database and an index of 5 billion web pages.
亚马逊网络服务的一些有趣的数据库,包括人类基因组计划,NASA数据库和50亿网页索引。
44
Google Public data explorer search through already mentioned and lesser known open data repositories.
Google公共数据探索去搜索已经提到的却鲜为人知的开放数据存储库。
Gapminder, a collection of datasets from the World Health Organisation and World Bank covering economic, medical and social statistics.
Gapminder,来自世界卫生组织和世界银行的数据集,涵盖经济,医学和社会统计数据。
Google Trends analyse the shift of searches throughout the years.
Google Trend可以分析了这些年来搜索的变化情况。
Google Finance, real-time finance data that goes back as far as 40 years.
Google财经,实时金融数据,可追溯到40年前。
UCI Machine Learning Repository, a collection of databases for the Machine Learning community.
UCI机器学习库,机器学习社区的数据库集合。
National Climatic Data Center, world largest archive of climate data.
国家气候数据中心,世界上最大的气候数据档案库。
While all of the above is far too much for humans to sift through, machines might be able to find a useful, and potentially profitable, correlation.
上述所有提到的数据库对于人类来说都太过庞大,但机器可能在上面的数据中找到有用,能带来收益的相关关系。
One Oracle blog post36 included this about Red Roof Inn:
关于Red Roof Inn酒店的一篇Oracle博客文章这样做了:
Marketers for the hotel chain took advantage of open data about weather conditions, flight cancellations and customers’ locations to offer last-minute hotel deals to stranded travelers.
酒店连锁店的营销人员利用天气状况,航班取消和客户位置的公开数据,为滞留旅客提供及时的酒店优惠券。
They used the information to develop an algorithm that considered various travel conditions to determine the opportune time to message customers about nearby hotel availability and rates.
他们利用这些信息开发了一套算法,该算法综合考虑了各种旅行因素件,来决定向客户发送附近酒店房间和费用信息的合适时机。
Might information on Iowa liquor sales be useful?
这些信息对爱荷华州酒类销售会有用吗?
“This dataset contains the spirits purchase information of Iowa Class ‘E’ liquor licensees by product and date of purchase from January 1, 2014, to current.
“此数据集包含爱荷华州’E’级酒牌许可人购买烈性酒的信息,按产品和购买日期收集,时间跨度是从2014年1月1日到当前。
The dataset can be used to analyze total spirits sales in Iowa of individual products at the store level.”37
该数据集可用于分析爱荷华州商店级别的烈酒销售情况。“
And don’t look now, but here comes the Internet of Things and the unbelievable amounts and types of data that will come spilling out.
不要仅仅盯着现在,物联网时代马上来临,这将会产生令人难以置信的数据类型和数量。
The same can be said for exhaust data.
同样的,对于数据的利用也会爆发
That’s information that’s a byproduct of some action, reaction, or transaction.
这些信息可能是某些行为,反应或交易的副产品。
Walking through a shopping center throws off lots of exhaust information about where you are.
穿过购物中心的行为会给出大量有关您所在位置的信息。
How often you respond to text messages, where you take pictures, and whether you speed up at yellow stop lights is reactive.
您对文字信息的响应频率,拍照的位置以及是否在黄灯时加速都是反应性的信息。
Whether stocks trade more when the market goes up or down is transaction-oriented.
当股市上涨或下跌时股票是否交易的更多就是以交易导向的信息。
There are, of course, companies that offer a conglomeration of the above as a service.
当然,有些公司提供上述数据集的整合服务。
Second Measure sells insights derived from credit
Second Measure提供来自信用卡交易的洞察,
card transactions so you can “spot inflections in businesses as they happen, identify this week’s fastest-growing companies [and] see the latest KPIs (Key Performance Indicators) before they’re announced.”
,因此您可以“及时了解业务发生时的影响,确定本周增长最快的公司[并]在正式数据宣布之前查看最新的KPI(关键绩效指标)。”
Mattermark monitors marketplace KPIs such as companies’ net revenue, gross margin, growth, market share, liquidity, average order value, Net Promoter Score, retention, cost per customer acquisition, marketing channel mix, overall ROI, and cash burn rate.
Mattermark监控公司的KPI,例如公司的净收入,毛利率,增长情况,市场份额,流动性,平均订单价值,净推荐值,客户留存,获客成本,营销渠道组合,总体ROI和现金消耗率。
This is a whole new data set for B2B sales and competitive intelligence.
这是全新的关于B2B销售和竞争情报的数据集。
The combination of all of the available data with the power of machine learning is cause for excitement and competitive advantage.
通过机器学习的能力,将所有可用数据都能利用起来,这让人感到很兴奋,同时也会带来竞争优势。
(See Figure 1.5.)
(见图1.5。)
Data for Sale
待售数据
Upon this gifted age, in its dark hour, Rains from the sky a meteoric shower
在这个被眷顾的时代,在它的黑暗时刻,从天而降的流星雨
Of facts….
事实……
They lie unquestioned, uncombined.
他们撒谎了,毫无疑问,
Wisdom enough to leech us of our ill
智慧让我们远离疾病
Is daily spun, but there exists no loom To weave it into fabric .
每天纺纱,但是没有织机把纱做出衣服。
.
.
.
Edna St. Vincent Millay from Sonnet 137, Huntsman, What Quarry?
埃德娜 圣文森特 米莱,十四行诗,Huntsman,什么采石场?
In an ideal world, the machine collects all the data there is and weaves it into a tapestry that makes all things clear at a glance.
理想世界中,机器收集所有数据并将其编织成一个让所有事物都一目了然的挂毯。
The data aggregation industry has been active for years, starting with the Census Bureau 115 years ago.
从115年前的人口普查局开始,数据聚合行业已经活跃很多年了。
Since then, it’s become a big business.
从那时起,它就成了一桩大生意。
The amount of available information is enormous from public records and criminal databases to credit rating firms and credit card companies to public companies like Dun & Bradstreet and Acxiom, which claims to have more than 32 billion records.
可用的数据量是巨大的,从公共记录,犯罪数据库到信用评级公司,信用卡公司,以Dun&Bradstreet和Acxiom这样的上市公司,他们声称拥有超过320亿条记录,可用信息量巨大。
That’s the sort of aggregator that powers most direct mail and telemarketers.
这些数据聚合起来为大多数邮件营销和电话推销提供支持。
Acxiom’s extensive third-party data offers rich insight into consumers and their behaviors:
Acxiom拥有的广泛的第三方数据为他们提供了消费者及其行为的丰富洞察:
Curated from multiple, reliable sources
从多个可靠的来源获取
Includes more than 1,000 customer traits and basic information including, location, age, and household details
包括1,000多个客户特征和基本信息,有位置,年龄和家庭详细信息。
+
RFID
Machine-generated data
Web logs
Internet of Things sensing
GPS-enabled big data telematics Intelligent transport systems
Mobile location
Weather data
Call logs voice audio
Facebook status
On-shelf-avallability
Volume and Velocity
Customer surveys
Traffic density
Twitter feeds
ERP Transaction data
Bar code systems
Claims data
Call center logs
Blogs and news
E-mail records
CRM Transaction data
Delivery expedite instances
Crowd-based pickup and delivery
Demand forecasts
Delivery times and terms
Customer location and channel
Transportation costs Origination and destination (OND)
EDI invoices / purchase orders
Competitor pricing
Loyalty program
Structured Data
–
Semi–Structured Data
Variety
Unstructured Data
Core Transactional Data Internal Systems Data Other Data
Figure 1.5 So many types of data, so little time38
图1.5 短时间获得这么多类型的数据
46
Provides more than 3,500 specific behavioral insights, such as propensity to make a purchase
提供超过3,500种特定的行为洞察,例如购买倾向
Offers real insights into a broad spectrum of offline behavior, not just indicators from web browsing behavior
提供真实的详细的线下行为洞察,而不仅仅网络浏览行为指标
Gives analysts more ways to segment data and use for audience modeling
为分析师提供了更多的方法进行数据细分和受众建模。
Acxiom data fuels highly personalized data-driven campaigns, enabling you to:
Acxiom数据可以为数据驱动的高度个性化的推广活动提供支持,使您能够:
Personalize messages and consistently engage audiences across all channels
推送个性化消息并持续的跨渠道吸引受众。
Incorporate both online and offline data in a safe, privacy-compliant way
兼顾安全,隐私的方式将在线和离线数据整合在一起
Segment audiences at the household or individual level based on a variety of options from ethnicity and accul- turation to digital behaviors
根据多个变量(族群,数字媒体的接受程度)在家庭或者个人层面上进行人群细分。
Optimize for scale and accuracy
优化规模和准确性
Request audience recommendations from seasoned data experts39
向经验丰富的数据专家咨询受众方面的建议
But Wait—There’s More
但等等 – 还有更多
The volume and variety of data seems to have no end.
这还有很多的数据。
The weather (http://www.ncdc.noaa.gov/)
天气(http://www.ncdc.noaa.gov/)
http://www.ncdc.noaa.gov/
美国人口普查数据(http://dataferrett.census.gov/)
U.S. Census data (http://dataferrett.census.gov/)
http://dataferrett.census.gov/
Japan Census Data (https://aws.amazon.com/datasets/ Economics/2285)
日本人口普查数据(https://aws.amazon.com/datasets/ Economics / 2285)
https://aws.amazon.com/datasets/Economics/2285
https://aws.amazon.com/datasets/Economics/2285
Health and retirement study (http://www.rand.org/labor/ aging/dataprod/hrs-data.html)
健康和退休研究(http://www.rand.org/labor/
aging/dataprod/hrs-data.html)
http://www.rand.org/labor/aging/dataprod/hrs-data.html
http://www.rand.org/labor/aging/dataprod/hrs-data.html
Federal Reserve economic data (https://aws.amazon.com/ datasets/Economics/2443)
美联储经济数据(https://aws.amazon.com/
datasets/Economics/2443)
https://aws.amazon.com/datasets/Economics/2443
https://aws.amazon.com/datasets/Economics/2443
The entire Internet for the past seven years (http://
过去七年的整个互联网数据(http:// commoncrawl.org/)
commoncrawl.org/)
125 years of public health data (http://www.bigdatanews.com/ group/bdn-daily-press-releases/forum/topics/pitt-unlocks- 125-years-of-public-health-data-to-help-fight-contag)
125年的公共卫生数据(http://www.bigdatanews.com/
group/bdn-daily-press-releases/forum/topics/pitt-unlocks-
125-years-of-public-health-data-to-help-fight-contag)
http://www.bigdatanews.com/group/bdn-daily-press-releases/forum/topics/pitt-unlocks-125-years-of-public-health-data-to-help-fight-contag
http://www.bigdatanews.com/group/bdn-daily-press-releases/forum/topics/pitt-unlocks-125-years-of-public-health-data-to-help-fight-contag
http://www.bigdatanews.com/group/bdn-daily-press-releases/forum/topics/pitt-unlocks-125-years-of-public-health-data-to-help-fight-contag
http://www.bigdatanews.com/group/bdn-daily-press-releases/forum/topics/pitt-unlocks-125-years-of-public-health-data-to-help-fight-contag
http://www.bigdatanews.com/group/bdn-daily-press-releases/forum/topics/pitt-unlocks-125-years-of-public-health-data-to-help-fight-contag
48
Consumer complaints about financial products and services (http://catalog.data.gov/dataset/consumer-complaint- database)
消费者对金融产品和服务的投诉(http://catalog.data.gov/dataset/consumer-complaintdatabase)
http://catalog.data.gov/dataset/consumer-complaint-database
http://catalog.data.gov/dataset/consumer-complaint-database
Product safety recalls from the Consumer Product Safety Commission (http://www.cpsc.gov/Newsroom/News- Releases/2010/CPSC-Makes-Recall-Data-Available- Electronically-to-Businesses-3rd-Party-Developers/)
消费品安全委员会的产品安全召回(http://www.cpsc.gov/Newsroom/News-
Releases/2010/CPSC-Makes-Recall-Data-Available-
Electronically-to-Businesses-3rd-Party-Developers/)
http://www.cpsc.gov/Newsroom/News-Releases/2010/CPSC-Makes-Recall-Data-Available-Electronically-to-Businesses-3rd-Party-Developers/
http://www.cpsc.gov/Newsroom/News-Releases/2010/CPSC-Makes-Recall-Data-Available-Electronically-to-Businesses-3rd-Party-Developers/
http://www.cpsc.gov/Newsroom/News-Releases/2010/CPSC-Makes-Recall-Data-Available-Electronically-to-Businesses-3rd-Party-Developers/
Franchise failures by brand (https://opendata.socrata.com/ Business/Franchise-Failureby-Brand2011/5qh7-7usu)
品牌特许经营失败案例(https://opendata.socrata.com/
Business/Franchise-Failureby-Brand2011/5qh7-7usu)
https://opendata.socrata.com/Business/Franchise-Failureby-Brand2011/5qh7-7usu
https://opendata.socrata.com/Business/Franchise-Failureby-Brand2011/5qh7-7usu
https://opendata.socrata.com/Business/Franchise-Failureby-Brand2011/5qh7-7usu
https://opendata.socrata.com/Business/Franchise-Failureby-Brand2011/5qh7-7usu
Top 30 earning websites (https://opendata.socrata.com/ Business/Top-30-earning-websites/rwft-hd5j)
收入前30名网站(https://opendata.socrata.com/
Business/Top-30-earning-websites/rwft-hd5j)
https://opendata.socrata.com/Business/Top-30-earning-websites/rwft-hd5j
https://opendata.socrata.com/Business/Top-30-earning-websites/rwft-hd5j
Car sales data (https://opendata.socrata.com/Business/Car- Sales-Data/da8m-smts)
汽车销售数据(https://opendata.socrata.com/Business/CarSales-Data/da8m-smts)
https://opendata.socrata.com/Business/Car-Sales-Data/da8m-smts
https://opendata.socrata.com/Business/Car-Sales-Data/da8m-smts
Yahoo!
雅虎
Search Marketing Advertiser bidding data (http://
搜索营销广告客户出价数据
webscope.sandbox.yahoo.com/catalog.php?datatype=a)
(http://
webscope.sandbox.yahoo.com/catalog.php?datatype=a)
http://webscope.sandbox.yahoo.com/catalog.php?datatype=a
http://webscope.sandbox.yahoo.com/catalog.php?datatype=a
http://webscope.sandbox.yahoo.com/catalog.php?datatype=a
http://webscope.sandbox.yahoo.com/catalog.php?datatype=a
American time use survey (http://www.bls.gov/tus/tables
美国人时间花费调查(http://www.bls.gov/tus/tables
.htm)
http://www.bls.gov/tus/tables.htm
.htm)
http://www.bls.gov/tus/tables.htm
Global entrepreneurship monitor (http://www
全球企业家监督(http://www
.gemconsortium.org/Data)
http://www.gemconsortium.org/Data
.gemconsortium.org/Data)
http://www.gemconsortium.org/Data
Wage Statistics for the U.S. (http://www.bls.gov/bls/blswage
美国的工资统计(http://www.bls.gov/bls/blswage
.htm)
http://www.bls.gov/bls/blswage.htm
.htm)
http://www.bls.gov/bls/blswage.htm
City of Chicago building permits from 2006 to the present (https://data.cityofchicago.org/Buildings/Building-Permits/ ydr8-5enu)
从2006年到现在芝加哥市建筑许可证(https://data.cityofchicago.org/Buildings/Building-Permits/ ydr8-5enu)
https://data.cityofchicago.org/Buildings/Building-Permits/ydr8-5enu
https://data.cityofchicago.org/Buildings/Building-Permits/ydr8-5enu
Age, race, income, commute time to work, home value, veteran status (http://catalog.data.gov/dataset/american- community-survey)
年龄,种族,收入,通勤上班时间,房屋价值,退伍军人现状(http://catalog.data.gov/dataset/americancommunity-
survey)
http://catalog.data.gov/dataset/american-community-survey
http://catalog.data.gov/dataset/american-community-survey
Or how about all of Wikipedia?
或者维基百科的全部内容如何?
(http://en.wikipedia.org/wiki/Wikipedia:Database_download)
(http://en.wikipedia.org/wiki/Wikipedia:Database_download)
A Collaboration of Datasets
数据集的协作
After three years as a systems analyst at Deloitte, Brett Hurt started one of the first web analytics companies (Coremetrics later sold to IBM), and an online reviews and ratings company (Bazaarvoice) has turned his attention to the world of data.
在担任Deloitte的系统分析师三年后,Brett Hurt成立了第一家网络分析公司(Coremetrics,后来出售给IBM),一家在线评论和评级公司(Bazaarvoice),他将自己的注意力转向了数据世界。
His current startup is data.world, a B-Corp (Public Benefit Corpora- tion) intent on building a collaborative data resource.
他目前的创业公司是data.world,一家致力于构建协作数据资源的公益公司(Public Benefit Corporation)。
From the outset, according to John Battelle,40 “Hurt & co. may well have unleashed a blast of magic into the world.”
根据John Battelle的说法,从一开始,“Hurt&co。很可能已经向世界释放了这样的魔力。”
49
The problem they are out to solve is allowing data to be visible.
他们要解决的问题是要让数据可见。
Rather than data shoved into its own database silo, hidden away from all other data, as we experience it now, data.world seeks to unlock that data and make it discoverable, just as the World Wide Web has brought links between research papers and marketing materials and blog posts.
正如我们现在所经历的:数据被装进自己的数据库孤岛,与所有其他数据隔离,而data.world寻求解锁那些数据并使其可被其他人发现,就像万维网做的那样,将在研究论文、营销素材、博客文章连在了一起。
One consistently formatted master repository, with social and sharing built in. Once researchers upload their data, they can annotate it, write scripts to manipulate it, combine it with other data sets, and most importantly, they can share it (they can also have private data sets).
一个通用格式的主存储库,内置社交和共享功能。一旦研究人员上传他们的数据,他们可以对其进行注释,编写脚本来操作它,将其与其他数据集合并,最重要的是,他们可以共享这些数据(他们也可以是私人数据集)。
Cognizant of the social capital which drives sites like GitHub, LinkedIn, and Quora, data.world has profiles, ratings, and other “social proofs” that encourage researchers to share and add value to each others’ work.
Cognizant这样的社会资本推动了GitHub,LinkedIn和Quora的发展,Data.world拥有专业知识,良好等级和其他“社会证据”,可以鼓励研究人员主动分享并增加彼此工作的价值。
In short, data.world makes data discoverable, interoperable, and social.
简而言之,data.world使数据可以被发现,可操作和可流通。
And that could mean an explosion of data-driven insights is at hand.
这可能意味着数据驱动洞察的时代即将到来。
For artificial intelligence to really flex its muscles, it must have a lot of data to chew on; data.world feels like a step in the right direction to join up the massive amounts of data that’s out there, for the use of all comers.
对于能真正发挥作用的人工智能来说,它必须有大量的数据可供吸收; data.world似乎朝着正确的方向迈出了一步,链接了大量的数据,供所有人使用。
A Customer Data Taxonomy
客户数据分类
The breadth of available data is overwhelming (social media graphs, Facebook Likes, tweets, auto registration, voting records, etc.).
可以利用的数据的种类是非常多的(社交媒体图,Facebook赞,推文,自动注册,投票记录等)。
It’s helpful to have a taxonomy at hand.
有一个数据的分类很有帮助。
Types of Collectible Information
收集的信息的类型
The wide variety of data is expanding at a phenomenal rate.
各种各样的数据正在以惊人的速度产生。
Here is an indicative but not exhaustive list of data sets shoved into categorization cubbyholes through sheer blunt force.
这有一个有指导性但不是最详尽的数据集列表,比较粗暴的分类。
Identity
身份
Can we identify them?
我们可以识别它们吗?
Who are they?
他们是谁?
Name
姓名
Gender
性别
50
Age
年龄
Race
种族
Address
地址
Phone
电话
Fingerprint
指纹
Heart rate
心率
Weight
重量
Device
设备
Government ID
政府ID
And so on
等等
History
历史
What’s in their past?
他们有什么样的过去?
What have they done or achieved?
他们做了什么或取得了什么?
Education
教育
Career
职业
Criminal record
犯罪记录
Press exposure
媒体曝光
Publications
出版物
Awards
奖励
Association memberships
协会会员
Credit score
信用分数
Legal matters
法律事务
Loans
贷款
Divorce
离婚
Where they have traveled
他们去过的地方
And so on
等等
Proclivities
倾向
What attracts them?
什么吸引他们?
Are they liberal or conservative?
他们自由还是保守?
What do they like?
他们喜欢什么?
Preferences
偏红
Settings
设置
Avocations
业余爱好
Political party
政党
Social groups
社会团体
51
Social “Likes”
社交媒体“赞”
Entertainment
娱乐
Hobbies
爱好
News feeds
新闻提要
Browser history
浏览历史
Brand affinity
品牌偏好
And so on
等等
Possessions
财产
What do they have, whether purchased, acquired, found, or made?
他们有什么,无论通过购买,获取,发现还是制造拥有的?
Income
收入
Home
家庭
Cars
汽车
Devices
设备
Clothing
衣服
Jewelry
珠宝
Investments
投资
Subscriptions
订阅
Memberships
会员
Collections
收集
Relationships
关系
And so on
等等
Activities
活动
Can we catch them in the act?
我们可以在行动中找到他们吗?
What do they do and how do they do it?
他们做了什么,他们是如何做到的?
Keystrokes
击键
Gestures
手势
Eye tracking
眼动追踪
Day part
白天部分
Location
位置
IP address
IP地址
Social posts
社交媒体发文
Dining out
外出就餐
Television viewing
看电视
52
Heart rate over time
心率
And so on
等等
Beliefs
信仰
How do they feel and where do they stand on issues?
他们感觉如何以及他们在问题上的立场?
Religion
宗教
Values
价值观
Donations
捐赠
Political party
政党
Skepticism/Altruism
怀疑论/利他主义
Introvert/Extrovert
内向/外向
Generous/Miserly
大方/吝啬
Adaptive/Inflexible
自适应/灵活的
Aggressive/Passive
主动/被动
Opinion
观点
Mood
心情
And so on
等等
Methods of Data Capture
数据获取的方法
All of the above comes to light in a variety of ways.
上面提到的所有数据可以通过很多方法获取。
The data scientist will be more responsible as time goes on—and legislation crops up—to know whether an individual data element was collected with full con- sent.
随着时间的推移和法律的完善,数据科学家有更大的责任保证个人数据的获取是否经过明确的同意。
The future will also require recording whether that consent was given in perpetuity or only for the purpose initially stated.
未来还将要求记录该同意是永久性的同意,还是仅适用于最初提出的用途的时候。
Here, then, are suggested categories of data capture, based on “The Origins of Personal Data and Its Implications for Governance” by Martin Abrams,41 which included a taxonomy based on origin.
因此,根据Martin Abrams撰写的“个人数据的源头及其对管理的影响”,建议使用这样的数据采集类别,其中包括基于起源的数据分类法。
Provided
提供
Individuals are highly aware when they are providing information.
个人在提供信息时非常清楚。
They might initiate the delivery of the information when filling out an application, registering to vote or registering a product for warranty, or acquiring a public license to drive, marry, or carry a gun.
他们可能会在填写申请表,注册投票,注册产品进行保修,获得驾驶证,结婚,携带枪支的公共许可证时提供这些信息。
The transactional provision of data happens any time people use a credit card.
在人们任何时候使用信用卡的时候,数据的交易条款生效。
They are clearly and knowingly identifying themselves.
他们清楚的知道这将关联到他们自己。
Paying a bill by writing a check qualifies as well, as does answering surveys, registering for a school, or participating in a court proceeding.
还有通过支票付款,回答调查问卷,学校注册或者参加法庭也一样。
This would also pertain to filling in one of those online quizzes (Which Star Wars Character are You?).
这也适用于填写其中一个在线测验的时候(比如,哪个星球大战角色是你?)。
53
Individuals are also said to be providing information when they post it publicly.
个人在公开发布信息时也会提供信息。
That may be delivering a speech in public, writing a let- ter to the editor for publication, or posting something online in a social network.
这可能是在公共场合发表演讲,写信给编辑寻求出版,或在社交网络上发布内容。
Posting happens when you announce to all of your Facebook friends that you are, indeed, Han Solo.
当您发帖,向所有Facebook好友宣布您确实是Han Solo时。
Observed
观察到的
Information can be casually observed.
信息也可能被偶然的观察到。
The Internet is an ideal place for observation as every click is recorded.
互联网是观察的理想场所,因为每个点击都被记录了。
People forget that their phone is always listening to them in case they wish to summon “Hey Siri” or “OK Google” by voice.
人们忘记了他们的电话总是在倾听他们,以便让他们可以通过语音唤醒电话“hey siri”或者“OK Google”。
Browser cookies and loyalty cards are examples of engaged obser- vations.
浏览器cookie和会员卡是参与到观察中的例子。
People go to a website intentionally.
人们特意去一个网站。
They have their grocery store card scanned on purpose.
他们特意去刷他们的杂货店卡。
They know they’re doing it, but they’re not thinking in terms of that action being revealing.
他们知道他们在做什么,但他们并没有考虑这个行动是否有别的影响。
They may choose to refuse to use their membership card or surf incognito, but they trade off convenience and discounts.
他们可以选择拒绝使用会员卡或隐姓埋名上网,但他们会损失掉折扣和便利性。
An unanticipated collection of data surprises people for an instant, and then they realize that they knew there were sensors and conclude they probably knew data was being gathered.
没意外的数据收集方法瞬间令人惊讶,然后他们意识到他们知道那里有传感器,并得出结论:他们可能知道数据正在被收集。
You know your car can talk to the cloud to get navigation map updates and to call for roadside assistance.
您知道您的汽车可以与云通信以更新导航地图并呼叫路边援助。
But you might not have read the manual where it talks about collecting information on engine temperature and tire pressure as well.
但是,您可能还没有阅读相关的手册,手册里里还有收集发动机温度和轮胎压力的信息。
The passive collection of data is where things start to border on creepy.
被动的数据收集是事情开始失去控制的地方。
People don’t expect to have their picture taken by a traffic camera and then dropped into a database.
人们不希望通过自己被交通相机拍到,照片然后被放入数据库。
They don’t expect their movements to be recorded as they walk around a department store.
当他们路过百货商店时,他们不希望自己的动作被记录下来。
There is no expectation of privacy, but the first time you become aware that it’s happening, you feel a little queasy.
没有隐私可言,但是当你第一次意识到这种事情正在发生时,你会感到有些不安。
After that, it becomes the new normal.
然后,这成为了新常态。
Derived
衍生
Now that the raw material has been scooped up, it’s time to start mas- saging it.
既然原材料已经获得了,那么现在是时候开始加工它了。
The amount of time you spend on one page or another is computationally derived.
您在一个页面或另一个页面上花费的时间是通过计算得出的。
We subtract the time you arrive from the time you leave, and voilá, time-on-page.
我们会用您离开的时间减去您到达的时间,瞧,这样就得到了你在页面停留时间。
This information must be calcu- lated.
这些信息必须通过计算得到。
How often do you search for gaming laptops?
您多久搜索一次游戏笔记本电脑?
How much do you usually buy on this site?
你一般在这个网站上花多少钱?
How often do you return?
你多久返回一次?
The result of each of these calculations is another data point that can be associated with an individual, but there’s no way for that per- son to know such provided and observed data is being manipulated.
这些计算中的每一个的结果都是可以与个人相关联的一个数据点,但是你无法知道这些提供的和观察到的数据后面怎么继续加工。
54
Data about you can be notionally derived by assigning you to a given category like lookie-loo versus serious buyer or soccer-mom versus sin- gle mother.
基于你的数据可以很自然的得到一些衍生的结论,就是通过把你划分到某一个类别,比如随便看看的人和重度用户,足够妈们或单亲妈妈。
This sort of classification is also invisible to the individual being labeled.
这种分类是被标记的个体不知道的。
Play your cards right and that merchant may decide you are a prime candidate for a super-discount-member category.
你很会过日子,那么商家可能会认为你是超级折扣会员类别的主要候选人。
Finding your- self misclassified can be surprising, annoying, or cause for arrest in the wrong database.
发现你自己被错误的归类可能是令人惊讶,烦人的或导致数据存在错误的数据库中。
Inferred
推断
Data that is created through inference has taken computational data a step further into analytical evaluation.
通过推理创建的数据促使我们把可计算的数据带到了分析评估领域。
Statistically inferred data determines whether you get a call while on vacation asking if that’s really you checking into that hotel.
统计推断的数据决定了您在度假时是否会接到电话,被询问您是否真的入住了该酒店。
Your FICO score is the statistical result of comparing you to others.
您的FICO分数是您与其他人进行比较后得到的统计结果。
Take statistics to their logical extreme and you have advanced analytical data.
从统计数据过渡到逻辑判断值,你就得到了高级分析数据。
Big data and AI are hard at work to correlate all of the above to come to a supposition about the prospect or customer.
大数据和人工智能正在努力将上述所有内容联系起来,以提出关于潜在客户或现有客户的假设。
How likely are you to be who you say you are?
你就是你自己说的那样的可能性有多大?
How likely are you to default on a loan?
您贷款违约的可能性有多大?
Contract a disease?
得病了?
Recommend this book to your friends?
推荐这本书给你的朋友?
The result of each of these data collection and derivation methods is—more data.
从这些数据收集和派生方法中的每一个的结果都产生了更多的数据。
Martin Abrams posits that data supplied by individuals will remain about the same as you only need a finite number of driv- ing or wedding licenses, even while uploading photos becomes more popular.
Martin Abrams认为,个人提供的数据将保持不变,因为您只需要有限的驾驶证或结婚证数据,即使上传照片变得更受欢迎。
However, observed data will enjoy healthy growth as more sensors are born into the Internet of Things.
然而,随着越来越多的传感器应用在物联网中,观察到的数据将会稳步的增长。
Abrams sees derived data losing ground as inferred data becomes more popular.
随着推断数据变得越来越普遍,艾布拉姆斯认为衍生数据会失去市场。
That brings us back to AI and machine learning.
这把我们带回到人工智能和机器学习上。
“Inferred data will accelerate as more and more organizations, both public and private, increasingly take advantage of broader data sets, more computing power, and better mathematical processes,” says Abrams.
“随着越来越多的公共和私人组织越来越多地利用更广泛的数据集,更强大的计算能力和更好的数学流程,推断数据的量将会加速增长,”艾布拉姆斯说。
“The bottom-line is that data begets more data.”
“最重要的是,数据会产生更多数据。”
Marketing Data Trustworthiness
营销数据可信度
Data is a wonderful thing—especially digital data because it’s binary.
数据是一件很棒的事情 – 尤其是数字数据,因为它是二进制的。
It’s either ones or zeros and crystal clear.
它要么是1要么是0,非常清晰的。
While we’d all like to believe that’s true, only those who don’t know data at all would fall for that.
虽然我们都愿意相信这是真的,但只有那些根本不了解数据的人才会这样认为。
So Much Data, So Little Trust
数据如此多,信任如此少
One of the more difficult aspects of marketing data is its uneven fidelity.
营销数据更难以理解的一个方面是它的参差不齐。
Transactions are dependable.
交易是可靠的。
A sale was made at a given time to a given
特定的时间给特定的人
55
person at a given price—all rather solid.
以特定的价格达成交易-所有的环节都很确定。
On the far end of the spectrum, social media sentiment is almost guesswork.
另一端,社交媒体的情绪分析几乎是猜测。
The conundrum comes when marketing professionals are asked to rank the relative reliability of various data sets; their minimal knowl- edge of the data stands in their way.
这时候我们遇到了一个难题:营销专业人员被要求对各种数据进行可靠的排名时,他们对于数据的不了解阻碍了他们这么做。
When multiple metrics are combined to form an index, the vari- able trustworthiness of the variables is completely hidden.
当需要组合多个指标去形成一个索引时,变量的置信度水平被完全的隐藏掉了。
The solution to this dilemma lies in data scientists working closely with marketers to properly weight the variety of data elements that go into the soup.
这种困境的解决方案在于数据科学家与营销人员的密切合作,给各种元素适当的加权。
So Much Data, So Little Connection
如此多的数据,如此少的连接
Matthew Tod of D4t4 Solutions Plc tells a story of trying to fit online data to offline data that starts after the struggle to line up the two is over.42
D4t4解决方案公司的Matthew Tod讲述了一个故事:将线上和线下数据整理后,它试图将线上他们进行融合。
I was working with a retailer with standard, online behavioral data from a tag-based, log file system tracking sessions.
我正在与零售商合作,处理标准的线上行为数据,这些数据来自标签化的日志文件系统(用于记录会话)。
Fortunately for me, they had linked them to email addresses so I have a key to join sessions to email addresses.
幸运的是,他们已将数据与电子邮件地址关联,因此我有一个主键可以将会话数据加入到电子邮件地址数据中。
They started issuing e-receipts.
他们开始使用电子收据。
You go into the store, you buy the stuff, they email you a receipt.
你进入商店,你买东西,他们给你发送电子收据。
But only about 35% of their in-store transactions warrant a receipt, any receipt, but that 35% of transaction accounts for 90% of sales revenue because people only want a receipt for insurance purposes, or for returning a product, so for the valuable stuff.
但只有约35%的店内交易需要收据,任何收据都行,但这35%的交易却占销售收入的90%,因为人们想要收据是为了保险或者以后退回产品,所以要收据的都是买贵重东西的交易。
For the little stuff, nobody is going to ask for an e-receipt.
买些小东西的时候,没有人会要求电子收据。
So, I end up with a data set roughly 80 million sessions on the website, a million email addresses and 55 million rows of transactional data.
因此,我最终在网站上收集了大约8000万个会话的数据集,一百万个电子邮件地址和5500万行交易数据。
I bring all of that together in order to answer the question, “What is the impact of Google on my physical store sales?”
我将所有这些数据汇总在一起,以回答这个问题“Google对我的实体店销售有何影响?”。
Because I now have a link from store sale to session, and via campaign back to Google with 300,000 people, I could say, matthew.tod@gmail.com went into our Wimbledon store on Saturday.
因为我现在可以关联上商店销售到和网站会话,有30万人通过通过谷歌搜索进入网站。我知道,matthew.tod@gmail.com这个人周六进入我们的温布尔登商店。
Funny enough, I noticed he was on our website on Thursday for 45 minutes, researching products.
有趣的是,我注意到他周四在我们的网站上停留了45分钟,研究产品。
matthew.tod@gmail.com
Obviously, my digital analytics regard that as an abandoned basket—fail—low conversion rate and my in-store manager goes, “Gosh, he is a great guy, he came in and spent five hundred quid!
显然,我的数字分析师把他作为一个失败的案例:废弃的购物车- 失败 – 低转换率,而我的店内经理说,“天哪,他是一个厉害的家伙,他进店后花了五百英镑!
Love him to bits!”
喜欢他的大方!“
56
We could show, in this particular instance, that for every Pound of sales the website thought they made, we could see two Pounds in-store.
在这个特殊的例子中,我们可以看到,对于网站上达成的一磅销售,对应的是我们在店内得到两磅。
That was the end of the official project with the start of the science project.
现在就是结束假想的研究,开始进行科学研究的时候了。
That’s when we started playing with machine learning.
现在是我们开始进入机器学习的时候了。
Even with the most reliable data, getting it all to make sense is still troubling.
即使拥有最可靠的数据,让一切变得有意义仍然是会令人不安。
ARE WE REALLY CALCULABLE?
我们真的是可计算的吗?
While the individual man is an insoluble puzzle, in the aggregate he becomes a mathematical certainty.
虽然这个人是一个不可解决的谜题,但总的来说,他在数学上是确定的。
Sherlock Holmes, The Sign of Four
夏洛克福尔摩斯,四个标志
On the BBC show Sherlock, Mary asked how Sherlock Holmes had managed to find her and the flash drive she was carrying around when, “Every movement I made was entirely random, every new personality just on the roll of a dice!”
在BBC节目Sherlock中,玛丽询问夏洛克福尔摩斯是如何找到她以及她随身携带的闪存时,“我所做的每一个动作都是随机的,每一个新的决定都只是在掷骰子!”
Sherlock replied:
Sherlock回答说:
Mary, no human action is ever truly random.
玛丽,没有任何人类行为是真正随机的。
An advanced grasp of the mathematics of probability, mapped onto a thorough apprehension of human psychology and the known dispositions of any given individual, can reduce the number of variables considerably.
对概率学的深入理解,结合人类心理学的人士和给定个体的已知偏好,就可以大大减少变数。
I myself know of at least 58 techniques to refine a seemingly infinite array of randomly generated possibilities down to the smallest number of feasible variables.
我自己知道就至少有58种技术可以将看似不确定的随机组合缩减到最小数量的可用变量。
After a brief pause, he admitted, “But they’re really difficult, so instead I just stuck a tracer on the inside of the memory stick.”43
在短暂的停顿之后,他承认,“但做到这些真的很难,所以我只是在你的存储卡里藏了一个追踪器。。”
This, then, is our task: to use the big data and machine learning tools we have at hand to see if we can’t build a better, more useful model of individual, human probabilities in order to send the right message to the right person at the right time on the right device.
这就是我们的任务:使用我们手边的大数据和机器学习工具,看看我们是否可以建立一个更好,更有用的个人/人类概率模型,以便在正确的时间正确的设备上向正确的人传达正确的信息。
Sherlock is right; it is difficult.
Sherlock是对的;这很难。
So, now you understand the idea of machine learning.
所以,现在你了解机器学习的概念。
You know just enough to hold your own at a cocktail party.
你知道的足以让你在鸡尾酒会上展现个性了。
You can nod know- ingly should the topic pop up and can comfortably converse with senior management about the possibilities.
如果相关的主题突然出现,你就可以与高级管理人员就某这个主题轻松的交谈,并频频点头示意。
The next chapter is intended to go one level deeper.
下一章旨在更深入的探索。
You will not become a data scientist by careful study of Chapter 2, but you will be able to hold your own at a meeting on machine learning.
通过仔细研究第2章,您不会成为数据科学家,但您将能够去参加一个机器学习会议了。
You can nod
您可以聪明的
57
knowingly should the subject matter get deeper and will be able to comfortably converse with data scientists about the possibilities.
点点头,并且能够与数据科学家就某些可能性进行轻松的交谈。
C H A P T E R 2