但是什么是模型

The term model gets thrown around a lot. The word is ubiquitous to the point of lost meaning. The Wikipedia page alone shows the variety of usage of the word model, including statistics, astronomy, biology, product design, art, as well as conceptual models.

术语模型引起了很多争议。 这个词到处都是意义不清的地方。 仅Wikipedia页面就显示了单词模型的各种用法,包括统计,天文学,生物学,产品设计,艺术以及概念模型。

The etymology of model is interesting as well, stemming through French and Italian back to the Latin modus, for ‘measure, rhythm, or way’.

模型的词源也很有趣,它通过法语和意大利语回到拉丁语“方法,节奏或方式”。

Nevertheless, the definition for ‘conceptual model’ captures the broadest interpretation of the word in any sense, as always from Wikipedia:

尽管如此,“概念模型”的定义在任何意义上都可以对词进行最广泛的解释,就像维基百科一样:

A conceptual model is a representation of a system, made of the composition of concepts which are used to help people know, understand, or simulate a subject the model represents.

概念模型是系统的表示,由概念的组成组成,用于帮助人们了解,理解或模拟模型所代表的主题。

这与数据科学有何关系? (How does this relate to data science?)

In Python, data scientists often use packages such as scikit-learn or statsmodels to run linear regressions, clustering algorithms, random forests, or neural nets on a variety of data for the sake of classification or prediction.

在Python中,出于分类或预测的目的,数据科学家经常使用scikit-learnstatsmodels等程序包对各种数据运行线性回归,聚类算法,随机森林或神经网络。

Meanwhile, the Ancient Greeks and Romans used a geocentric model of the solar system to make sense of their universe, a cosmological model which dominated their understanding of the universe until December 1610, when Galileo inferred that Venetian phases ruled out the geocentric, or Ptolemaic, model, finally verifying a sun-centered model.

同时,古希腊人和罗马人使用太阳系的地心模型来理解他们的宇宙,这是一种宇宙学模型,一直主导着他们对宇宙的理解,直到1610年12月伽利略推断威尼斯的相位排除了地心或托勒密时,模型,最后验证以太阳为中心的模型。

These usages of the word model may seem unrelated, but returning to our definition above, they are in fact the same applications of the word. Why? Our tangible interaction with Python models consists in Jupyter notebooks, that little * next to a cell while the model is built (while praying for no errors), and pickling those cherished prime-F1-score models for usage on AWS or on Heroku. But what is actually going on under the hood in something like scikit-learn? Why call it a model, apart from tradition? How does a pickled random forest model have anything to do with a little toy model of the solar system with metal spheres, or a styrofoam Bohr model of an atom?

单词模型的这些用法似乎无关紧要,但是回到上面的定义,它们实际上是单词的相同应用。 为什么? 我们与Python模型的有形交互包含在Jupyter笔记本中,在构建模型时(在祈祷没有错误的情况下)在单元格旁边很少*,并腌制这些珍贵的Prime F1分数模型以在AWS或Heroku上使用。 但是,像scikit-learn这样的东西到底是怎么回事? 除了传统之外,为什么还要称呼它为典范? 腌制的随机森林模型与带有金属球的太阳系的小玩具模型或原子的泡沫聚苯乙烯玻尔模型有什么关系?

Because these models are actually intended to be representations of reality. That is what unites them. This is a sublime and difficult point to make, but this fact has more significance than may appear at first glance.

因为这些模型实际上是为了表示现实。 那就是团结他们的原因。 这是一个崇高而困难的观点,但是这个事实比乍看上去可能具有更大的意义。

模型,数据科学和维特根斯坦 (Models, Data Science, and Wittgenstein)

Image for post
Public Domain 公共领域

The Tractatus Logico-Philosophicus, by the philosopher Ludwig Wittgenstein, is often considered one of the most important works of the twentieth century, a treatise which unites logic, science, and philosophy and culminates in a mystical refutation of itself and, to an extent, philosophy. It is a beautiful work, laid out in seven overarching propositions, which essentially state:

哲学家路德维希·维特根斯坦(Ludwig Wittgenstein)提出的《逻辑哲学哲学论》( Tractatus Logico-Philosophicus )通常被认为是20世纪最重要的作品之一,该论文将逻辑,科学和哲学结合在一起,并在对自己的神秘反驳中达到顶峰,在一定程度上,哲学。 这是一部精美的作品,它列出了七个总体命题,这些命题基本上说明:

The world consists in a ‘state of affairs’ or facts about the world. We develop representations of these facts in the form of logical propositions, though we can never explicitly say what a fact and a representation of a fact have in common, we can only show what they have in common, and this is an essentially mystical aspect of human life, which also makes philosophizing, including this whole work (the Tractatus), a waste of time except for that it liberates you from the urge to continue philosophizing.

世界在于关于世界的“事态”或事实。 我们以逻辑命题的形式发展这些事实的表述,尽管我们永远不能明确地说出一个事实和一个事实的表述有什么共同点,我们只能展示它们的共同点,这本质上是一个神秘的方面。人类的生活,这也使哲学化,包括整个作品(《论》)浪费了时间,除了它使您摆脱了继续进行哲学化的冲动。

For example, when we correctly recognize a mother’s face in the face of a child, what exactly are we seeing ‘in common’? We can’t quite put our finger on it: we see a family resemblance, that’s it. We see something that we can’t say, which we nevertheless know to be true, and to attempt to articulate that commonality fails miserably.

例如,当我们正确地识别孩子脸上的母亲的脸时,我们究竟看到了什么“共同点”? 我们不能完全依靠它:我们看到了家庭的相似之处,仅此而已。 我们看到了我们无法说的东西,尽管我们知道那是真的,并试图阐明共同性惨遭失败。

Even if we try to reduce this activity to a neurological state, a neuron-based type of facial recognition and feature-firing (which is the current suspected function of the fusiform face area of the temporal lobe), we don’t actually use that reductionistic-approach in the act of recognition itself: we simply do it, and that metaphysical distinction pervades all human activity and is the source, according to Wittgenstein, of our ethical and religious truths which must stay apart from the realm of scientistic [sic] fact.

即使我们试图将这种活动减少到神经状态,基于神经元的面部识别和特征触发(这是颞叶梭状面部区域的当前可疑功能),我们实际上也没有使用在承认自己的行为还原论的方法:我们只是做它,这形而上的区别遍及所有的人类活动和为源,根据维特根斯坦,我们的道德和宗教真理必须从境界离得远科学主义[原文]事实。

Finally, getting back to models (and the point): models are just such representations! When using scikit-learn to make something as simple as a linear regression model, we are using the computational ability of computers to simulate and codify this process of human-brain-driven isomorphism recognition. The coefficients of a regression model are expressions of belief in the influence of certain features towards a certain target-variable related intention, which is usually prediction.

最后,回到模型(和重点):模型就是这样的表示! 当使用scikit-learn制作像线性回归模型一样简单的内容时,我们正在使用计算机的计算能力来模拟和编码人脑驱动的同构识别过程。 回归模型的系数是对某些特征对某个目标变量相关意图的影响的信念表达,通常是预测。

But a wondrous realization about these models is that, despite the feeling that Python models are just a sequence of 1s and 0s which cleverly captures gradient-descent driven loss-function minimization, the process of model building is parallel to the human activity of prediction and explanation! The human brain, whether making a numerical mental estimate, or attempting to verbalize a memory of a physical event, or recognizing the face of an old friend over a few seconds, is picking out isomorphic features of reality in its mental representation. The major difference is that Python models lack sentience: they are still tools which require a human understanding in order to effectively manipulate, and I believe this is the fundamental reason why the job of a data scientist is less about model building in Python, or large-scale distributed computing in AWS, but rather about producing insight, visualization, and expanding top-level understanding about a domain.

但是,关于这些模型的一个奇妙的认识是,尽管感觉Python模型只是一个1和0的序列,可以巧妙地捕获梯度下降驱动的损失函数最小化,但是模型的构建过程与人类的预测和预测活动平行。说明! 人脑,无论是进行数字化的心理估计,还是试图表达对某种物理事件的记忆,或者试图在几秒钟内识别一个老朋友的脸,都在其心理表征中挑选出现实的同构特征。 主要区别在于Python模型缺乏感知力:它们仍然是需要人类理解才能有效操作的工具,我相信这是数据科学家的工作较少涉及Python建模或大型开发的根本原因。 AWS中进行大规模分布式计算,而不仅仅是产生洞察力,可视化并扩展对域的顶级理解。

Interestingly, this is also why data science is so broad of a field: the scope of model building is as wide as there are human tasks to simulate. Word embeddings for natural language processing, convolutional models for image recognition, or regression models for numerical prediction are models of sub-domains of the world which, for most of human history, have been human generated.

有趣的是,这也是数据科学如此广泛的原因:模型构建的范围是如此之大,因为要模拟人工任务。 用于自然语言处理的词嵌入,用于图像识别的卷积模型或用于数字预测的回归模型是世界子域的模型,在人类的大多数历史中,这些子域都是人为产生的。

The ultimate data science model, similar to the perfect map in the short story On Exactitude In Science by Jorges Luis Borges, would just be the world itself! But such a model would be, and is, too cumbersome to wield effectively. We make models to simplify the world in an actionable way, and I believe any data scientist would benefit from this perspective.

最终的数据科学模型类似于世界本身,就像乔治·路易斯·博尔赫斯(Jorges Luis Borges)的短篇小说《科学的卓越性》中的完美地图一样。 但是,这样的模型将是而且过于繁琐而无法有效地运用。 我们建立模型以可行的方式简化世界,我相信任何数据科学家都将从这个角度受益。

That said, I suspect the veteran data scientists already have. Thanks for reading!

话虽如此,我怀疑资深数据科学家已经拥有了。 谢谢阅读!

翻译自: https://towardsdatascience.com/but-what-is-a-model-58c486cbd40a