應妖妹邀請,虎哥將這個程序修改,打出來的正是妖妹需要的格式。妖妹別怕,絕不能手動收集活動帖子,辛苦得要命。使用這個修改後的python程序,活動帖子的收集基本全自動.
You can copy/paste the codes below into a Python program. If you have Python 3 installed on your computer, you can then follow the prompted instructions to make 活動帖子的收集基本全自動.
慶賀妖妹的活動大獲成功!
# Author: 書香之家版主 nearby, March 2022 # # Usage of this Python program: # 0. Make sure that you have Internet access and Python 3 installed on your computer (or use Cloud)! # 1. Place this file in a folder. Say, in a folder named "wxc" # 2. Go to your '論壇', search for your '活動' title. You will get one or more pages. Remember how many pages there are. # If you do not know how to do this, just skip this step, I will then assume that there are 3 pages (150 entries, which is more than usual) # 3. execute this program, you will be prompted (asked for) the name of your activity, and # the number of pages you obtained in step 2 (if you do not know the number of pages, just hit ENTER) # Example: # 春天的暢想 # 3 (or Hit ENTER key) # 4. You will also be prompted for your 論壇's name in alphabets/English. You can look up this in your 論壇. # For example, 書香之家 has the URL https://bbs.wenxuecity.com/sxsj/, so its English name is sxsj. # Other examples include: 美語世界 is mysj, 文化走廊 is culture, 詩詞欣賞 is poetry, etc. # 5. The result is stored inside 'wxc/sxzj-out.html'. You can then copy/paste the source code of 'sxzj-out.html' into your WXC new page. Done! # # # Note: By default the entries are organized in reverse chronological order. # Should you need them to be placed in chronological order, please do: # Comment out the statement: mylist.reverse() by placing # in front of it, like: #mylist.reverse() # # import requests notargets = ['跟帖', '輸入關鍵詞', '內容查詢', 'input name', '當前', '首頁', '上一頁', '尾頁', '下一頁'] notargets.append('archive') # This is how SXZJ (書香之家) works. When 無憂 starts an activity, she always marks her activity like this. notargets.append('##活動##') # notargets.append('匯總') def isInside(line, notargets_array): for t in notargets_array: if t in line: return True return False # END # the line looks like <a href="/sxsj/76799.html" target="_blank">【<em>春天的暢想</em>】春天屬於女人</a> # I need it to be <a href="https://bbs.wenxuecity.com/sxsj/76799.html" target="_blank">【<em>春天的暢想</em>】春天屬於女人</a> def addHttp(line): at = line.split('href="') line2 = '<a href="https://bbs.wenxuecity.com' + at[1] return line2 # END def processOneFile(target, html, mylist): # split the text by newline character to get an array of string all = html.text.split('\n') length = len(all) i = 0 while i < length: line = all[i] if (target in line) and (not isInside(line, notargets)): line = addHttp(line) print(line) i = i + 1 line2 = all[i] # look like: [書香之家] - <strong>WXCTEATIME</strong>(6987 bytes ), need to be WXCTEATIME only line2 = line2.replace('</strong>', '<strong>').split('<strong>')[1] line += " " + line2 mylist.append(line) i = i + 1 # END of FUNCTIONS # ---- main starts here ---- print() print('# Author: 書香之家版主 nearby, March 2022') print() target = input('What is the title of your activity (活動)?: ') pages = 3 # default, means there are maximum 150 entries temp = input('How many pages there are when you search for the activity in WXC? (If you do not know, just Hit ENTER): ') if temp != '': pages = int(temp) subid = 'sxsj' temp = input('What is the name of your 論壇 in English? For example, 書香之家 is sxsj, 美語世界 is mysj, 文化走廊 is culture, 詩詞欣賞 is poetry: ') if len(temp) >= 2: subid = temp mylist = [] # this is the output file. html2 = open('sxzj-out.html', 'w', encoding='utf-8') url = 'https://bbs.wenxuecity.com/bbs/archive.php?SubID='+subid+'&pos=bbs&keyword=' + target + '&username=' f = requests.get(url) processOneFile(target, f, mylist) for i in range(1, pages): url = 'https://bbs.wenxuecity.com/bbs/archive.php?page=' + str(i) + '&SubID=' + subid +'&pos=bbs&keyword=' + target + '&username=' f = requests.get(url) processOneFile(target, f, mylist) mylist.reverse() for li in mylist: html2.write("<p>" + li+"\n") html2.close() print("\n") print(str(len(mylist)) + " entries") print("\n") print("Please check the file sxzj-out.html. The result is in it! Thanks for using this program. ---- 虎哥 / Nearby ")