OpenAI數據也是爬來的,爬完就是你的啦,TM的不要臉給TM開門

OpenAI’s data, particularly for training large models like GPT-3 and GPT-4, comes from a variety of publicly available and licensed sources. Here’s a breakdown of the key sources OpenAI uses for training its language models:

1. Public Web Data

  • Web Crawls: OpenAI uses a wide range of publicly available web pages, which include websites, blogs, forums, news articles, and other textual content freely available on the internet.
  • Books and Articles: Content from publicly available books, research papers, articles, and other publications.
  • Wikipedia: Wikipedia’s vast amount of knowledge across different topics is often a key resource.
  • Forums and Social Media: While OpenAI may use data from platforms like Reddit or StackExchange (among others), it's important to note that any data derived from these platforms is typically aggregated and anonymized.

2. Licensed Data

OpenAI may also have access to proprietary data through licensing agreements with certain organizations, such as:

  • News sources: Subscription-based news websites or archives, which provide high-quality content for training.
  • Research Papers: Databases like arXiv or academic publishers where papers are publicly available or licensed for use.

3. Books and Academic Journals

  • OpenAI uses a large corpus of books and academic papers across various domains to give the model a broad knowledge base, particularly in specialized fields like science, technology, literature, history, and more.

4. Code and Programming Resources

  • Models like GPT-4 have been trained on a large corpus of code from open-source platforms like GitHub to better understand and generate code across a variety of programming languages.

5. Other Datasets

OpenAI uses a range of curated datasets, such as:

  • Common Crawl: A massive dataset of web data scraped regularly.
  • Project Gutenberg: A collection of free eBooks, especially classic literature.
  • Open Subtitles: Text data from movie subtitles, which help improve conversational understanding.

 

所有跟帖: 

沒毛病,學到人類的知識就是你的,你再創作,編成書,寫成作品就有版權了! -猛牛- 給 猛牛 發送悄悄話 猛牛 的博客首頁 (54 bytes) () 01/29/2025 postreply 22:25:11

侵犯版權上法院起訴就是了,你在這裏喊一千遍也沒用 -花點牛牛- 給 花點牛牛 發送悄悄話 (0 bytes) () 01/29/2025 postreply 22:27:15

別急! 有人在評估怎麽做最合適? -猛牛- 給 猛牛 發送悄悄話 猛牛 的博客首頁 (80 bytes) () 01/29/2025 postreply 22:29:08

急的是你吧,一早就在喊,沒完沒了 -花點牛牛- 給 花點牛牛 發送悄悄話 (0 bytes) () 01/29/2025 postreply 22:32:52

你自己查查看,我今天發貼量隻有你的三分之一, -猛牛- 給 猛牛 發送悄悄話 猛牛 的博客首頁 (101 bytes) () 01/29/2025 postreply 22:36:28

有關剽竊的指控今天就是你發起的 -花點牛牛- 給 花點牛牛 發送悄悄話 (0 bytes) () 01/29/2025 postreply 22:40:20

我隻是最早轉個Link到此壇而已。 -猛牛- 給 猛牛 發送悄悄話 猛牛 的博客首頁 (461 bytes) () 01/29/2025 postreply 22:43:50

問題是現在有個80分的,沒有風險,不用花錢,還有個90分的,要花大錢,還可能被抓 -靈山問禪- 給 靈山問禪 發送悄悄話 (30 bytes) () 01/29/2025 postreply 22:29:28

你可以爬別人,別人就可以爬你,說過了數據沒版權,算法才有 -鬼眼狂刀- 給 鬼眼狂刀 發送悄悄話 (0 bytes) () 01/29/2025 postreply 22:29:34

你說了不算! -猛牛- 給 猛牛 發送悄悄話 猛牛 的博客首頁 (51 bytes) () 01/29/2025 postreply 22:30:51

若真沒版權,設置障礙不讓你繼續偷竊,總可以吧? -猛牛- 給 猛牛 發送悄悄話 猛牛 的博客首頁 (54 bytes) () 01/29/2025 postreply 22:33:02

你有什麽證據說他們偷竊? -花點牛牛- 給 花點牛牛 發送悄悄話 (0 bytes) () 01/29/2025 postreply 22:34:43

還是你覺的在這裏喊上一千遍就是證據了,可笑 -花點牛牛- 給 花點牛牛 發送悄悄話 (0 bytes) () 01/29/2025 postreply 22:37:45

OpenAI 有證據!我們是看戲的,怎麽會有? -猛牛- 給 猛牛 發送悄悄話 猛牛 的博客首頁 (107 bytes) () 01/29/2025 postreply 22:38:38

我也有證據領居偷我東西,但我拿不出來,也不敢上法院,大家等著吧,哈哈 -花點牛牛- 給 花點牛牛 發送悄悄話 (0 bytes) () 01/29/2025 postreply 22:42:24

他們是虧錢,急紅眼,需要從16億申請到經費補缺口。 -評論2012- 給 評論2012 發送悄悄話 (0 bytes) () 01/29/2025 postreply 22:26:33

他們買了很多NVDA? 那是有麻煩了 -花點牛牛- 給 花點牛牛 發送悄悄話 (0 bytes) () 01/29/2025 postreply 22:28:16

前幾天有人貼了佩婆的交易,NVDA期權難道跟風去買了? -花點牛牛- 給 花點牛牛 發送悄悄話 (0 bytes) () 01/29/2025 postreply 22:29:23

讀書、讀報紙,來學習,與抄別人作業,完全是兩碼事 -未知- 給 未知 發送悄悄話 未知 的博客首頁 (125 bytes) () 01/29/2025 postreply 22:30:32

請你先證明OpenAI沒抄。再證明DS沒做作業 -鬼眼狂刀- 給 鬼眼狂刀 發送悄悄話 (0 bytes) () 01/29/2025 postreply 22:35:04

狡辯。現在需要證明的就一個 -未知- 給 未知 發送悄悄話 未知 的博客首頁 (45 bytes) () 01/29/2025 postreply 23:14:21

OpenAi要指控就拿出證據 -花點牛牛- 給 花點牛牛 發送悄悄話 (0 bytes) () 01/29/2025 postreply 23:20:20

Open AI 會給維基之類的付費吧? -julie116- 給 julie116 發送悄悄話 julie116 的博客首頁 (0 bytes) () 01/29/2025 postreply 22:36:27

不需要 -鬼眼狂刀- 給 鬼眼狂刀 發送悄悄話 (734 bytes) () 01/29/2025 postreply 22:41:23

希望他們能分攤些費用。免得維基老可憐巴巴募捐。讓人不好意思不捐。另外按這個開放數據DS通過付費用戶獲取數據也說得過去啊 -julie116- 給 julie116 發送悄悄話 julie116 的博客首頁 (0 bytes) () 01/29/2025 postreply 22:54:38

NYtimes vs OpenAI -mobius- 給 mobius 發送悄悄話 (0 bytes) () 01/29/2025 postreply 22:42:25

請您先登陸,再發跟帖!