I.T. Spices The LINUX Way
Python In The Shell: The STEEMIT Ecosystem – Post #117
SCRAPING ALL BLOGS USING PYTHON – THE POST BODY
Please refer to Post #110 for the complete python script and the intro of this series, link below:
https://steemit.com/blockchain/@lightingmacsteem/2rydxz-i-t-spices-the-linux-way
In this long post we will be discussing how we arrived at the BODY of the blog post, said body contains the TEXTS, the VIDEO link/s and the IMAGE link/s; of course all of which if applicable.
Lines 109 to 147 is laid out below, take note of the VIDEOS, IMAGES and TEXT PARAGRAPHS portions:
109
110 #BODY
111 print('\nBODY:')
112 flogs.write('\nBODY:')
113 par = soup.findAll('p')
114 vid = soup.findAll('div', {'class':'videoWrapper youtube'})
115 img = soup.findAll('img')
116 #FOR YOUTUBE VIDEOS
117 if vid != []:
118 print('\n' + ' VIDEO link/s:')
119 flogs.write('\n' + ' VIDEO link/s:')
120 for v in vid:
121 jpglink = re.search('\(.+?\)', str(v))
122 imglink = jpglink.string[jpglink.start():jpglink.end()]
123 vidlink = os.popen('echo "' + str(imglink) + '" | cut --delimiter=/ -f5').read()
124 print(' https://www.youtube.com/watch?v=' + str(vidlink))
125 flogs.write('\n https://www.youtube.com/watch?v=' + str(vidlink))
126 #FOR IMAGES
127 if img != []:
128 print('\n' + ' IMAGE link/s:')
129 flogs.write('\n' + ' IMAGE link/s:')
130 for i in img:
131 imagelink = re.search('\".+?\"', str(i))
132 imageurl = imagelink.string[imagelink.start():imagelink.end()]
133 print(' ' + str(imageurl))
134 flogs.write('\n ' + str(imageurl))
135 #PARAGRAPHS TEXT
136 if par != []:
137 print('\n' + ' TEXT Paragraphs:')
138 flogs.write('\n' + ' TEXT Paragraphs:')
139 for sss in par:
140 body = (sss.text).replace('\n', '\n ')
141 print(' ' + body)
142 flogs.write('\n ' + body)
143 with open('/dev/shm/steemblogs/samplesoup', 'a+') as f:
145 f.write('\n##############################################################################################################################\n')
146 f.write(soup.prettify())
147 f.write('\n##############################################################################################################################\n')
Lines 111 to 112 prints the part which the python program is processing, this time the BODY.
Lines 113 to 115 are variables declaring the three parts of a BLOG post, the VIDEO link, the IMAGE link and the TEXT PARAGRAPHS. Said lines are looking for clues if there are any using the BeautifulSoup module.
Lines 116 to 125 is now zooming in on the VIDEOS, with an IF statement as a starting point to make sure that it will only process such if there are links found; of course no need to waste our time for something that returns zero results.
Further to the VIDEO portion, take note of line 121. This is a python line that looks for a string of text inside the () characters. It uses the re module and if there are any found the python program memorizes it for further processing. The VIDEO id is in such string of characters disguised as an image file, and if we can get this id we can then reconstruct the full URL of the video by combining the fixed characters “https://www.youtube.com/watch?v='” and the captured id of the image.
Lines 126 to 134 deals with IMAGE links contained n the BODY, of course if applicable. The IF statement at line 127 will see to it that we only waste our time further finding images if the said img variable found a match which will mean a positive result.
Further to the IMAGE portion, take note of line 131 as the python re module searches for a text string inside the “” characters. If found, any text string inside such double-quote characters will be memorized by python signifying a positive result for an image link.
Lines 135 to 147 deals with the PARAGRAPHS OF TEXTS as one writes a blog post. And since not all blogs contain paragraphs of words, once again at line 136 we declared thru the IF statement that we only go further in this if there is a positive result.
Further to the PARAGRAPHS portion and in line 140, take note that we are looking for texts inside the par variable named as sss, said par variable is already declared earlier as a BeautifulSoup formatted python strings. Anything texts inside such par variable are surely words of alphabets that can only mean one thing, the text paragraphs of a blog.
Take note that if no data found in these portions (VIDEO or IMAGE or PARAGRAPHS), the code is arranged to not display it anymore, even the title lines.
Cleaner, better, much faster. I think that’s the best way to sort out Post Office letters of old, just my take.
”Destiny Is If One Finds A Letter For Him Among Thousands.”
Congratulations @lightingmacsteem! You have completed the following achievement on the Steem blockchain and have been rewarded with new badge(s) :
Click here to view your Board
If you no longer want to receive notifications, reply to this comment with the word
STOP
Hello @lightingmacsteem! This is a friendly reminder that you have 3000 Partiko Points unclaimed in your Partiko account!
Partiko is a fast and beautiful mobile app for Steem, and it’s the most popular Steem mobile app out there! Download Partiko using the link below and login using SteemConnect to claim your 3000 Partiko points! You can easily convert them into Steem token!
https://partiko.app/referral/partiko
Congratulations @lightingmacsteem! You received a personal award!
You can view your badges on your Steem Board and compare to others on the Steem Ranking
Vote for @Steemitboard as a witness to get one more award and increased upvotes!