Analyze robots.txt with Python Standard Library

2023-09-24
If haven’t searched both “python” and “robots.txt” in the same input box, I would not ever know that Python Standard Library could parse robots.txt with urllib.robotparser. But the official document of urllib.robotparser doesn’t go into detail. With the document, you could check whether a url can be fetch with a robot with robot_parser_inst.can_fetch(user_agent, url) if you are building a crawler bot yourself. But if you want to do some statistics about robots. Continue reading

Process AWS Kinesis Firehose data with Python

2023-09-11
Pipelining streamed data events directly into Amazon S3 via the AWS Kinesis Firehose service is convenient and efficient. And we could use Python with boto3 to consume the data directly from S3. This allows for seamless storage of your data, ensuring its integration and accessibility. Mostly, we are dealing with JSON-formatted event logs. But there is one tiny stone in the shoe for logs feeding from AWS Kinesis Firehose, there is no newline between consecutive log entries. Continue reading

The curious cases of json_extract

2020-04-25
json_extract is a function for extract data from a JSON document both in MySQL and MariaDB. The function normally works fine, for example set @j = '{"num": 42, "list": [1, 2, 3], "obj": {"name": "Edward Stark"}}'; select json_extract(@j, '$.num') as num, json_extract(@j, '$.list') as list, json_extract(@j, '$.obj') as obj; +------+-----------+--------------------------+ | num | list | obj | +------+-----------+--------------------------+ | 42 | [1, 2, 3] | {"name": "Edward Stark"} | +------+-----------+--------------------------+ But when it comes to a single JSON string, the results of json_extract is not always as expected. Continue reading

Contextvars and Thread local

2019-04-12
Here in the post, I will share some examples about the contextvars (new in Python 3.7) and thread local. Default Value In the module level, use ContextVar.set or directly setattr for a thread local variable, won’t successfully set a default value, the value set won’t take effect in another thread. To ensure a default value, for contextvars import contextvars context_var = contextvars.ContextVar("context_var", default=0) for thread local, a sub class of thread. Continue reading

Mongodb(via MongoEngine) join query with aggregate

2017-01-09
Since Mongodb 3.2 and MongoEngine 0.9, we can use $aggregate command to perform join queries on multiple collections in a database. This post would be a simple tutorial for join queries on Mongodb(via MongoEngine in Python) with examples. Models Setup Let’s consider models defined as below: import random import mongoengine class User(mongoengine.Document): meta = {"indexes": ['rnd']} name = mongoengine.StringField() rnd = mongoengine.FloatField(default=random.random) class Group(mongoengine.Document): meta = {"indexes": ['rnd']} name = mongoengine. Continue reading
Newer posts